Module name:
screenscraper
The HTML Content Gear retrieves the contents of a Web location, given its URL, then renders the location as gear content. The gear also ensures that all the URIs in the Web content are properly rewritten so that the links shown in the gear content point to the original Web server.
Instance Configuration
To configure an instance of the HTML Content Gear:
Enter the Community Administration and click the configure link for the gear instance.
In the configuration basics page, enter a gear name and description. Select sharing and Make visible to options and click Done.
Click the additional configurations link. The additional configurations page has two subpages, url settings and alerts.
url settings: Enter the URL of the content you want displayed in the gear. You can also include a URL and link text for users to display the content in a full page. You can change the resource bundle if you are localizing this gear instance for a language other than English. If you specify something other than the default, you must make sure the new resource bundle location is specified in the CLASSPATH.
alerts: Specify whether users can receive alerts from this gear.
Configuring the HTMLFilterParser
The behavior of the HTML Content Gear is also affected by a Nucleus component named /atg/portal/gear/screenscraper/HtmlFilterParser
. This component has three properties that you may want to configure:
tagsToRemove
A list of tags that you want to remove from the source Web page so that those tags from the source page do not interfere with the rendering of the content into the gear’s content pages. For example, suppose something like this appears in the source content:
<title>This is the Title</title>
and you have specified title
as one of the items in the property tagsToRemove
. Then, the above string will be rendered in the gear’s content pages as This is the title
, without the <title>
tags.
tagsToRemoveWithBody
A list of tags that you want to remove, together with the tags’ contents, from the source Web page. This does the same thing as the tagsToRemove
property, except that it will remove not just the specified tags but also anything between the start and end tags.
For example, if this appears in the source content:
<title>This is the Title</title>
and you specified title
as one of the items in the property tagsToRemoveWithBody
, then whole string will be removed, including both the <title>
tags and the This is the title
string.
replaceBodyTagWithTableTag
The parser replaces the <body>
tag with a <table>
tag so that the community page where the gear is installed is not messed up due to the bgcolor
or background
attributes of the source page’s <body>
tag. This functionality can be turned off by setting the replaceBodyTagWithTableTag
property to false
.
Extending the HtmlFilterParser
The Portal module includes the source code for the atg.portal.gear.screenscraper.HtmlFilterParser
class in the <ATG11dir>/Portal/screenscraper/src/classes.jar/atg/portal/gear/screenscraper
directory. You can modify the class to do your own custom parsing. You can even replace this parser with a parser of your own by subclassing the atg.portal.gear.screenscraper.HtmlFilterParser
and overriding the parse(Reader pIn, Writer pWriter)
and parse(InputStream pIn, OutputStream pOut)
methods. This might enable, for example, the capability of replacing other tags in the source page.