9 Improved Mapping HTML Markup from XML into Multi-line Text Fields
There has long been some support for mapping MLT fields using (limited) HTML markup from
XML. The simplest situation can designate paragraphs using the <p> tag
and bulleted lists using the ordered-list <ol> or unordered-list
<ul> with list-item <li> tags. Also supported
were text modifiers like bold <b>, italic <i>, and
underline <u>.
Enhancements have been added to the markup support and some of these changes may impact your prior regression results.
- Recognition of paragraph tags is case insensitive.
The original implementation of this functionality years ago defined most tags in uppercase and checking for lowercase equivalent tags was inconsistent. For instance, a paragraph tag was expected to be <P> rather than <p> as is the present standard. During this review, tag identification has been validated to be case insensitive. This means that a legacy implementation may now result in different content if there were occurrences of lowercase tags that were previously being ignored. Recognition of these previously ignored paragraph tags could result in your content generating additional lines of spacing although the text content will likely remain consistent.
- The content of a MLT field should default to the font and color specified on the field.
Unless overridden by a specific attribute in the markup, the font and color of the text should correspond to that specified by the field. This was not happening in prior versions but will going forward. Alternately, you can start including font or color modifier tags in the markup.
- The Font <font> tag is considered obsolete but still supported.
The legacy support for the font <font> tag with attributes “face=” and “size=” are still supported, even though the font tag is considered obsolete starting in HTML5. However, one significant difference starting this version is that the “size=” modifier no longer assumes the number following is a point size.
For example, in prior versions a tag <font size=”2”;> would have assumed you wanted a very small 2-pt font size. This is now more correctly identified as indicating you want a “step down” from the default font size which is assumed to be a 12-pt font. HTML specification defined the font size= attribute as a range from 1 – 7 where 3 was assume the default 12-point font size. This table indicates the size assumption.
| Size=”n” | Point Size |
|---|---|
| 1 | 8 |
| 2 | 10 |
| 3 | 12 |
| 4 | 14 |
| 5 | 16 |
| 6 | 20 |
| 7 | 24 |
The actual font size selection in your setup will depend upon the font cross-reference (FXR) file. For instance, if your current font family does not have a 10-point size defined in the FXR, the system may choose a size larger or smaller.
As noted, the font tag will still be recognized in the markup, but consider having new mappings use the supported <span> tag rather than <font>.
- Span
<span>tag is now supported.
New to this version is support for the <span> tag. A span can be used to define attributes to change the font style include the font-family and the font-size. If your markup previously included <span> tags, these were being ignored, but will now alter your output.
This is the preferred replacement for the <font> tag that is now obsolete starting with HTML5. Note that legacy support for the <font> tag will continue and has not been removed. For more information refer to the font tag section of this document.
Description of specific attributes supported in the <span> tag are
listed separately in this document.
- Font-Style attribute supported for all paragraph-level tags (
<p> <br> <li>) and<span>tag.
In this version, the style= attribute modifier for paragraph or span tags will all parse and interpret the font-family, font-size, and color attributes in the same way. In addition, the modification of these attributes will be limited to the “scope” of the tag to which these are associated. This means that at the closing tag, the prior text styling attributes that were in existence before this tag began will be restored.
<p style="font-family: Arial; font-size: 14pt; color: #0000FF;">This text should be Arial 14pt and blue</p> This paragraph tag specifies a font change for both the name of the font and the size. In addition, the text color has also been changed.
The style attribute is a quoted string that specifies one or more modifiers separated with a semicolon. The attributes may be specified in any order or omitted as necessary.
Attributes not included are assumed to be “inherited” from what was established before this tag began. So a tag of <span style=”color: green;”> will change the text color to green, but does not modify whatever font name or size had previously been defined. At the close of this span </span> the color will revert to whatever color had been established before this span began.
Please note that this behavior of restoring the prior attributes at the close of the associated tag might yield a regression difference to legacy behavior.
- Color attribute now supports color name keywords for all paragraph-level tags (
<p> <br> <li>) and<span>tag.
Legacy markup mappings that recognized color only supported the hexadecimal #rrggbb format. New to this version, support for standard html color names has been added. In general, your use will probably be for standard color names like red, green, blue, black, cyan, yellow, magenta, etc. The full list of supported names is extensive. An example list can be found here: https://www.w3schools.com/colors/colors_hex.asp
- Font-size attribute now supports measurements other than points (pt) supported for all paragraph-level tags (
<p> <br> <li>) and<span>tag.
Legacy interpretation of the font-size attribute assumed the reference was a point size (pt) even if the value specified an alternate measurement descriptor. In addition to point (pt), this version introduces support for font-size specified in centimeter (cm), millimeter (mm), inches (in), pixels (px), picas (pi) and em (from em quadrat). This discussion will not go into a detailed explanation, but please recognize that the sizes specified with inches or centimeters likely require decimal values like 0.12in or 0.5cm.
This new support for the measurements might be revealed in regression testing. For instance, if the markup had specified “font-size: 16px”, a prior version would have interpreted this as requesting a 16-point font size. With this new support, this request would result in a 12-point font selection as this is what 16 pixels indicates.
- Font-size attribute now supports keyword size description for all paragraph-level tags (
<p> <br> <li>) and<span>tag.
| Keyword | Point Size |
|---|---|
| xx-small | 4 |
| x-small | 6 |
| small | 10 |
| medium | 12 |
| large | 14 |
| x-large | 18 |
| xx-large | 24 |
The actual font size selection in your setup will depend upon the font cross-reference (FXR) file. For instance, if your current font family does not have a 6-point size defined in the FXR, the system may choose another adjoining size – smaller or larger.
- Support for HTML character entities has been added.
Character entities are escaped using the ampersand & followed by a keyword, decimal value, or hexadecimal value. Standard ASCII characters fall in the range from 32-255. Documaker support for Unicode characters range up to decimal 65535 or hexadecimal 0xFFFF.
Any ASCII character may be represented in HTML using a decimal or hexadecimal entity. For instance the capital letter A could be represented as A or A. There is generally no reason to use a character entity for most standard text. However, because HTML (and XML) is often encoded as UTF8, certain ASCII characters greater than code-point 127 must be represented with a character entity. In addition, certain characters are interpreted to delineate structural components of HTML/XML. For instance, less than < and greater than > symbols are used to delineate nodes in the content. The quote is used in XML and HTML to delineate attributes or other components. If these “control” characters are instead meant to be part of the text element, these characters need to be escaped using a character entity.
Note:
- Documaker does not support all possible HTML character entity names or representations. Only a subset of entities are supported. These will be described here.
- The HTML markup is provided via XML extract and certain HTML escapements may be misinterpreted by the XML parser. To ensure your markup is bypassed by the XML parser, you should escape your escapement.
Note:
& < and > are all recognized by XML as character entities. If the text is meant to only apply to the HTML markup, your references should be depicted as: &amp; &lt; &gt; in the XML.
This first & will be recognized and converted by the XML parser to ampersand without trying to interpret the entity meant for the HTML markup. When the content is later mapped to a MLT field, the HTML markup text references will be represented as expected.
As noted earlier, any standard ASCII character can be represented as a decimal or hexadecimal value. Decimal representations start with the hash #, followed with up to 3-digits for standard ASCII codepoints or 5-digits for Unicode codepoints. Hexadecimal values also start with hash but followed by the small-letter x and then the hex representation for that codepoint.
Examples:
The copyright symbol © can be represented as © or ©
The greater than symbol > can be represented as > or >
There are keyword designations for common character entities that may be easier to use and recognize.
| Character | Name | Decimal | Hex | Comment |
|---|---|---|---|---|
| " | " | " | " | Double quote |
| & | & | & | & | Ampersand |
| ‘ | ' | ' | ' | Apostrophe |
| < | < | < | < | Less than |
| > | > | > | > | Greater than |
| |   |   | Hard space (non-breaking space) | |
| ¢ | ¢ | ¢ | ¢ | Cent |
| £ | £ | £ | £ | Pound |
| ¤ | ¤ | ¤ | ¤ | Currency |
| ¥ | ¥ | ¥ | ¥ | Yen |
| ¦ | ¦ | ¦ | ¦ | Break vertical bar |
| © | © | © | © | Copyright |
| ® | ® | ® | ® | Registered trademark |
| ° | ° | ° | ° | Degree |
| ± | ± | ± | ± | Plus-Minus |
| ² | ² | ² | ² | Superscript 2 |
| ³ | ³ | ³ | ³ | Superscript 3 |
| ´ | ´ | ´ | ´ | Acute |
| ¶ | ¶ | ¶ | ¶ | Paragraph mark |
| · | · | · | · | Middle dot |
| ¹ | ¹ | ¹ | ¹ | Superscript 1 |
| ¿ | ¿ | ¿ | ¿ | Inverted question mark |
| ÷ | ÷ | ÷ | ÷ | Divide |
Any entity not recognized not recognized as a keyword or is formatted improperly will be left undisturbed in the content. If you see a representation that did not convert to the character you expected, be sure to check the formatting and/or spelling of the element.