2 Publisher Tasks

Site Studio Publisher allows you to define multiple tasks used in crawling and publishing the sites you maintain. These tasks are used to define how much of the site is crawled and how often the results are published.

Additionally, some default value of Site Studio Publisher, such as queue size, can be modified in the configuration file.

This section covers the following topics:

2.1 About Publisher Tasks

Tasks in Publisher are individual commands used to crawl a certain URL and then publish the gathered content. Each task can be made to crawl across an entire web site, or across just a few pages.

Each task crawls the web site (or portion of the web site) at the time marked in the task. Tasks can also be created to run only on demand.

2.2 Creating and Editing Tasks

Creating a task requires little more than a URL of the site to crawl. Tasks are easy to create and edit because of the simple interface.

2.3 Specifying Task Parameters

The different parameters for a task are relatively straightforward. You need to have the URL for the web site, and username and password, and you must set the parameters for what days the task will crawl the site. Some of the parameters, such as triggers, are more complex.

The parameters are grouped into separate sections on the page.

This section deals with the following topics:

2.3.1 Settings

The Settings of a task include some of the most basic information of a Publisher task. This includes the location of the site to crawl, any necessary login information and what errors are critical.

The Settings section has advanced settings, which can easily be seen when you click the Show Advanced Settings link. The advanced settings expand and display in the same section.

Description

The name of the task is used to differentiate it from other tasks. It can also contain some information about how often the task runs, or what site it crawls, and so forth.

Manifest Url

This is the URL for Site Studio Publisher to crawl and then publish. The server, instance, and site must be filled in by the user.

Output Path

The local directory where the content is downloaded. Important state information is stored here as well. This is relative to the SSPHome location.

Username

The user ID, for password-protected sites.

Password

The password, for password-protected sites. The password is encrypted in the .hda file.

Authentication

The type of authentication used on the site, for password-protected sites.

  • LoginForm is used for Oracle Content Server 11g servers.

  • BasicAuth is used for Oracle Content Server 10g servers.

  • ExtranetLook is used for Oracle Content Server 10g servers with the ExtranetLook component.

  • NTLM is used for Oracle Content Server 10g servers configured to use NTLM.

  • CustomForm is used to customize a different form-based login.

Publish Now

Publish Now is a feature that marks pages with an implied high priority. As a contributor edits a Web page, they may mark a page by clicking "Publish Now" from Contributor. Publisher obtains a list of the marked pages from the server when it requests the list at the beginning of its operation.

Note:

The "Publish Now" feature must be enabled in Designer. See the Oracle WebCenter Content User's Guide for Site Studio Designer for more information.

When a crawl is executed, if the Publish Now control is set to true, Publisher only compares and publishes the marked, changed pages. If the altered content includes new links, Publisher also publishes those links. This crawl is suitable to run frequently.

Force Download

When Publisher typically examines links, the current server state is compared with the record of what Publisher previously found. What Publisher does not check is if the asset previously found is still in place on the local file system. The Force Download control is used to override in case the output file system has missing or corrupt assets.

Force Analyze

The Force Analyze control is useful when you have implemented filters in your task specification. This control is only invoked during analysis, so use of this control forces the analysis pass on those files marked for examination on this crawl. If no controls are used to limit the examination operation, all files are re-analyzed with this setting.

This control is intended for use during development. The best use of the control is when filterset and filter controls within the task specification are used. If a filter is changed, then this usually affects some link rewriting behaviors. Normally, if the dynamic page has not changed, the filters are not re-applied. The result is that when a dynamic page has not changed, the links are not re-written using the changed filters.

Treat Home Page Errors As Critical

This control allows you to stop the task if errors are found while retrieving the home page. If the box is checked, then any error on the home page will abort the task.

Treat Manifest Errors as Critical

This control allows you to stop the task if there are errors in the URLs listed in the site manifest. If the box is checked, then any errors found retrieving the URLs in the site manifest will abort the task.

Use Cache Control

Cache Control is a control that allows the server to compare the Max Age (caching) section property controlling the CacheControl:max-age http header. Using this control, Publisher only selects and crawls those pages where the max-age value has not expired.

This allows various parts of the site to be labeled as "rarely changed" and still create, control, and execute positive checks at controlled intervals. This enables you to run weekly, monthly, quarterly, or even annual publishing controls to enable timely execution of the publication of the web site.

Use Last Modified

The Last Modified control is used to compare the http header Last-Modified returned from the web server. The web server normally returns this header for all resources accessed with a weblayout URL. (This response header is not provided for a dynamic Site Studio page.)

If useLastModified is checked, the Last-Modified value is re-submitted in an If-Modified-Since request header the next time that resource is retrieved. That then allows the web server to return a 304 - Not Modified response if the resource is unchanged. If this is not checked, the resource is downloaded and compared with the previously retrieved content.

You would only choose to uncheck this control if you found that your web server was returning unreliable results for Last-Modified.

Use SSPETag

The SSPETag control is used to identify changes to files retrieved by the GET_FILE service. Use of this mechanism allows the content server to return a 304 - Not Modified response when applicable and so avoid unnecessary downloads.

Without this mechanism, file content is retrieved and compared with the previous version.

Default Filename

The default filename is the filename used for URLs that do not have an explicitly-specified filename.

Page Extension

The extension that is added to page urls that do not otherwise specify an extension. For example, a typical reference to a document in the dynamic site might look like:

http://myServer/mySite/Section1/DocumentX

Where DocumentX is the dDocName of the target item in Oracle Content Server. For use on a static site, you must add an extension.

User Agent

This control enables you to specify a value for the User-Agent http request header used by Site Studio Publisher when crawling the site.

Friendly Url Parameters

This control enables you to select a comma-separated list of additional parameter names to honor. Friendly Urls already honor parameters ending in NextRow or dcPageNum, and use the parameter name and value to construct the filename for the crawled page.

If a URL Parameter affects the appearance of the page, then you need to capture a different copy of the page for each combination of parameter values.

Dynamic Url Parameters

This control enables you to select a comma-separated list of additional parameter names to honor. Dynamic Urls using the GET_PAGE service already honor dID, dDocName, RevisionSelectionMethod, and Rendition.

If a URL Parameter affects the appearance of the page, then you need to capture a different copy of the page for each combination of parameter values.

Additional Services

This control enables you to select additional services to be crawled. There is built-in support for those services that are expected to generate meaningful static content: SS_GET_PAGE, GET_FILE, and GET_DYNAMIC_CONVERSION. Any services entered will be added to those three as the crawled services. There is not a method to prevent the crawl from using one of the three default services.

For more information on the Site Studio Services, see the Oracle WebCenter Content Technical Reference Guide for Site Studio.

Soft Error Threshold

This control determines the number of soft errors allowed. If this number is exceeded, publishing fails.

Hard Error Threshold

This control determines the number of hard errors allowed. If this number is exceeded, publishing fails.

Delete Threshold

This control determines the number of missing or deleted objects found while crawling. If this number is exceeded, publishing fails. Entering -1 in this field allows any number of missing objects.

Soft Error Codes

This control enables to you specify which server error codes are treated as soft errors. No wild cards are allowed; each code must be entered explicitly in a comma-separated list.

Hard Error Codes

This control enables to you specify which server error codes are treated as hard errors. No wild cards are allowed; each code must be entered explicitly in a comma-separated list.

Ignore Error Codes

This control enables to you specify individual error codes that are ignored. These codes will not affect the Site Studio Publisher crawl. No wild cards are allowed; each code must be entered explicitly in a comma-separated list.

2.3.2 Dates

The scheduling of a task starts with the Dates section. Here, a task is set as either a task that runs according to a schedule, or is run only on demand.

If you select that a task is run on demand - that is, by selecting Manual - then the Days (see Section 2.3.3, "Days") and Run Times (see Section 2.3.4, "Run Times") sections will be unavailable.

Manual / Range

This control enables you to set if the task will be run only when selected, or if it will run at regular, scheduled intervals within the selected date range.

If you select Manual, then the Start Date and End Date fields, and the Days (see Section 2.3.3, "Days") and Run Times (see Section 2.3.4, "Run Times") sections will be made unavailable.

Start Date / End Date

This control enables you to set the start date and end date of the scheduling period for this task. There is a calendar control next to each date field.

2.3.3 Days

The Days section controls the scheduling of the task by day. Here you can set if the task is run on certain days of the week (such as every Monday and Thursday) or on certain days of the month (such as the 10th and 20th of the month).

The options in this section will be available only if Range was selected in the Dates section (see Section 2.3.2, "Dates").

Days of Week

This control enables you to specify which days of the week the task will run, regardless of what day of the month that day is.

Days of Month

This control enables you to specify which days of the week the task will run. This list can be a range, or a comma-separated list of specific dates, or a combination. For instance, 1, 15 and 10-20 and 1, 7-10, 15-19, 23 are all valid entries.

You can also use the word last in this field to denote the last day of the month. It can be used in a list, such as 15, last to run on the 15th of the month as well as the last day of the month.

2.3.4 Run Times

The Run Times section is used to specify how often the task runs each day.

The options in this section will be available only if Range was selected in the Dates section (see Section 2.3.2, "Dates").

Multiple Times per Day / Once per Day

This control enables you to select how often the task runs each day, and during what times of the day.

If you select Multiple Times per Day, you should then enter how often the task will run using the Hours and Minutes fields. Additionally, you must enter the time of day that the task will run during.

If you select Once per Day, the task will run only at one time of day, which is entered in the box.

2.3.5 Options

The Options section is used to set the priority and log levels of the task. You can also set an email address to receive notification after the task runs.

Priority

This control enables you to select if the task will be run at a priority level. If unchecked, the task will run at the normal level.

Priority tasks are run in a different queue than normal tasks. Since most tasks are expected to be set to normal priority, the expectation is that few, if any, tasks will be run with a high priority, enabling those that are set at high priority to run sooner and end sooner. Additionally, priority tasks would check only part of a site, by using Publish Now or Include List.

Log Level

This control enables you to set how much information is entered in the log.

Each item in the drop-down list includes the logging levels above it. For example, selecting INFO includes logs of not only INFO items but also WARN and ERROR.

Email Notification

This control enables you to enter an email address that Publisher will send an email message to once the task has completed (or failed due to errors).

2.3.6 Include List

This control enables you to select individual site URLs which will be specifically included in the task. If a URL entered here is also listed in an exclude filter in the Filter set (see Section 2.3.8, "Filter Sets"), the filter set takes precedence, and the URL will not be included in publication.

Each time you click Add New Item, an additional field for entering a URL will display.

2.3.7 Exclude List

This control enables you to select individual site URLs which will be specifically excluded from the task.

Each time you click Add New Item, an additional field for entering a URL will display.

2.3.8 Filter Sets

A filter element describes a transformation that is applied to a downloaded content file. A filterset contains filter elements, and each filter is applied in the order the filter elements appear in the filterset. Several types of filter can be specified. Specification of a filter element depends on the kind of filter used, as specified by the type attribute.

Filtersets are used in different contexts: when applying filters to a downloaded file, for instance, or when deciding whether a given URL should be downloaded.

Each attribute of a filterset is compared to the relevant characteristics of the source content. Each and every attribute must match for the filterset itself to be considered a match. The filtersets are evaluated in the order in which they appear in the job description file, and the first filterset match is applied.

Type

This filterset attribute specifies the content that the filterset will invoke. The types available are:

  • transform-content: for URLs that match this filterset, the child filter elements are applied during download, transforming the content in the manner specified by these filters.

  • transform-link: for URLs that match this filterset, the child filter elements are applied during download. In this case, the filter elements are only applied to links found in the current downloaded file, not to the entire content.

  • exclude: for URLs that would normally be included, match this filterset, and the URL matches, content is not downloaded.

Path

This filterset attribute specifies a wildcard pattern to match the file path of the URL (the part following the URL's host name) currently downloaded. This attribute uses Java regular expression syntax for pattern matching.

Use the following reference for Java syntax:

http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html

Host Name

This filterset attribute is a value used to match the URL's host name.

Port

This filterset attribute is a value used to match the URL's port number.

Mime Type

This filterset attribute is a value used to match the URL's MIME-type. The URL values that starts with the entered filter value are considered a match; text matches both text/html and text/xml.

Each filterset is a collection of at least one filter. These filters act in the order they are listed within the filterset, just as the filtersets act in the order they are listed.

The filters have the following attributes:

Regex

This control is the regular expression used to identify text to replace in the context of the file or link.

Replacement

This is the replacement value for the identified regex string.

Path Regex

This control is the regular expression used to identify a path and file to replace in the context of the current file. The Path Regex and Path Replacement are the best method of finding and renaming output files.

Using path regex is only meaningful within an enclosing transform-content filterset. Additionally, it does not use the Global and Ignore Case checkboxes. It is always case-sensitive, and replaces only the first match.

Path Replacement

This is the replacement value for the identified path regex string.

Global

This attribute is used to control the behavior of the substitution. If Global is checked, every match for the regex expression is replaced. If unchecked, only the first occurrence is replaced.

Ignore Case

This attribute is used to control the behavior of the regex expression. If Ignore Case is selected, then the case of the expression is ignored when comparing the regex value.

2.3.9 Triggers

Triggers enable the replication engine to run commands after the download is complete. A trigger is defined by a trigger element.

Username/Password

The username and password used for authentication. You cannot specify different authentications for each trigger. All triggers in a task must have the same credentials and authentication method.

Authentication

Select to authenticate the entered username and password either with basic authentication or with a custom form.

Type

The type control determines the type of command functionality.

A cmd trigger allows a shell script or an operating system command to be run as a trigger. The command is passed to the operating system.

The http-post trigger sends a command to the specified URL using HTTP_POST functionality.

The http-get trigger sends a command to the specified URL using HTTP_GET functionality.

The http-soap trigger sends a command, encoded as a SOAP package, to the specified URL using HTTP_POST functionality.

Command

This trigger control holds a command-line if the type is set to cmd. For the other http trigger types, the control holds a URL.

This is the command-line argument to pass through the trigger.

Examples of a command can be:

ping MyServer
cmd /C ren %%d\*.xml *.html

The first example (in Windows) would simply ping the MyServer server. The second is an example of a Windows command-line trigger renames any file with the filename extension of .xml to use the filename extension .html.

To run this trigger in the directory where content files have been downloaded, you must use %%d, followed by \ in a Windows environment or / in a UNIX environment, to represent the name of the download directory.

Command Data

This control is used to specify the path, relative to the task's output path, of the file to be uploaded.

This is valid only when the trigger type is http-post or http-soap.

For the http-post, is the application/x-www-form-urlencoded data to be submitted with the POST. Essentially, this is a list of name-value pairs separated by the & character. For example:

objectType=package&actionType=update&packageName=NAME

For http-soap, it is the location of a file containing the SOAP commands to be executed.

Response File

This control is used to specify a partial path to the response file relative to the Output Path for the task. The response file is used to capture all trigger responses.

SOAPAction

This control is used to set the value for the SOAPAction HTTP request header field.

This is valid only when the trigger type is set to http-soap.

Run the trigger even if the crawl failed

This control is used to determine if the trigger command should execute, even if there are other errors in the crawl.

Run the trigger if there were no changes

This control is used to specify if the trigger will run if there has been no change to the site.

Ignore trigger failure

This control is used to determine if subsequent triggers should be run, or if the task should simply stop. If checked, further triggers will run even if the trigger fails.

Log the response

This control is used to specify if the trigger response is written to the logfile.

2.4 Configuring Publisher

There are some changes you can make to the default values of Publisher. The config.cfg file can be edited to allow different default actions and paths to be used.

Attribute Purpose Defaults
SSPHome=D:/SSPHome This option sets the "home" directory for Publisher's relative paths for output and auxiliary files. data/SiteStudioPublisher
SSPLockFileTimeout=30 This option controls the time that the Publisher output directory is locked. This value defines how old a file must be before it will be considered "stale" and be overridden. 30 (seconds)
SSPOverrideStaleLockFile=true This option enables or disables the SSPLockFileTimeout mechanism. true (enabled)
SSPDeleteLockFile=true This option determines if Publisher will delete the "lock" file at the end of a run. true
SSPTimerTickInterval=5000 This option determines, in milliseconds, the interval at which Publisher will examine the task schedule and queues to run eligible tasks and update status 5000 (milliseconds)
SSPViewLogPageLength=5000 This option lets you control the number of lines displayed per page when viewing a log file. By default, Publisher breaks a log file into 5000 line chunks for viewing in the "View Log" page (see Section A.5, "View Log"). 5000 (lines)
SSPLogFileAgeLimit=7 This option determines how long log files are stored. Publisher will keep log files spanning up to 7 days for each task. Old logs for a specific task are discarded whenever that task is run. 7 (days)
SSPPriorityTaskThreads=1 This option determines the number of tasks marked "Priority" that run at the same time. The sum of the Priority and Normal tasks should not exceed 4. 1
SSPNormalTaskThreads=1 This option determines the number of tasks not marked "Priority" that run at the same time. The sum of the Priority and Normal tasks should not exceed 4. 1
SSPProxyHost This option lists the address of an http proxy server, if used. If set, this will override any JVM values used. No default value.

(Example: proxy.example.com)

SSPProxyPort This option lists the port used for the http proxy server, if used. If set, this will override any JVM values used. 80
SSPNonProxyHosts This option lists any servers that should be connected to directly and not through the proxy listed in SSPProxyHost. No default value.

(Example: MyServer|*.example.com)


The config.cfg file is accessed in the General Configuration section of the Admin Server, on the Additional Configuration Variables screen.