AquaLogic Interaction Administrator Guide

     Previous Next  Open TOC in new window   View as PDF - New Window  Get Adobe Reader - New Window
Content starts here

HTML Metadata Handling

Generally, you will be able to determine what source document attributes can be mapped to portal properties, but it might not be as clear in HTML documents.

This table shows the names of the attributes that are returned by the HTML accessor. You can map the attribute names to portal properties.
Note: The HTML Accessor handles all common character sets used on the web, including UTF-8.
HTML Metadata Name of Attribute Returned by HTML Accessor Default Mapping or Mapping Suggestion
<TITLE> Tag Title Title (default)
<META> Tag The attribute name is the NAME value.

Example:

<META NAME="creation_date" CONTENT="18-Jan-2004"> 
The attribute that would be extracted from the example would be named “creation_date”
Using the example, you could map creation_date to the Created property.
Headline Tags The attribute name is the name of the tag followed by an ordinal, one-based index in parentheses.

The Accessor returns a value for each headline tag (<H1>, <H2>, <H3>, <H4>, <H5>, and <H6>) and each bold tag (<B>).

Example:
<H1>Value 1</H1>
<H3>Value 2</H3>
<H1>Value 3</H1>
<B>Value 4</B>
The HTML Accessor returns the following source document attribute-value pairs:
<h1>(1)		 Value 1
<h3>(1)		 Value 2
<h1>(2)		 Value 3
<B>(1)		 Value 4
If on a particular news site, the second <H2> tag contains the name of the article and the third <B> tag contains the name of the author, you could map the portal property Title to <H2>(2) and the portal property Author to <B>(3).
HTML Comments It is common practice to store metadata in HTML comments using the following format:
<!-- Writer: jm -->
<!-- AP: md -->
<!-- Copy editor: mr -->
<!-- Web editor: ad -->
In other words, the format is the HTML comment delimiter followed by the name, a colon, the value, and a close comment delimiter. The HTML Accessor parses data in this format and returns the following source document attribute-value pairs:
Writer jm
AP md
Copy editor mr
Web editor ad
Using the example, you could map Writer to the portal property Author.
Parent URL Documents imported via a web content crawl return an attribute named Parent URL with the value of the URL of the parent page that contains a link to the document. URL (default)
Anchors The HTML Accessor provides special handling for internal anchors (<a name=”target”>) and URLs that reference them (http://server/page#target). You might map anchors to portal attributes in the following ways:
  • Alternate Sources for the portal Title attribute

    When the document URL for an HTML document contains a fragment identifier (for example, #target in the example above) and the Accessor finds that anchor in the document, it discards all title and headline tags preceding the anchor and returns, as the suggested document title, the first subsequent headline tag. All subsequent tags are indexed relative to the anchor tag, so mapping a property to <H1>(2) means “use the second <H1> tag after the anchor tag named in the document URL.”

  • Mapping Anchor Section to Document Description or Summary

    The HTML Accessor returns an attribute named Anchor Section containing text immediately following the named anchor tag (stripped of markup tags and HTML decoded). Mapping this property to the document description allows the portal to generate a relevant description for each section of a large document.

    The HTML Accessor generates its own summary by returning the first summary-sized chunk of text in the document stripped of HTML markup tags and correctly HTML decoded. It returns this summary as an attribute named Summary.

    The Accessor executes the DocumentSummary method, which returns the value of the Anchor Section attribute, if available. If this attribute is not available, its second choice is the value of the Description attribute from the <META NAME=”description”> tag. If this is not available, its third and final choice is the Summary attribute.


  Back to Top      Previous Next