Implementation Issues

This chapter covers some issues specific to using the Export products.

Running in 24x7 Environments

To ensure robust 24x7 performance in server applications embedding the different export products, it is strongly recommended that the technology be run in a process separate from the server's primary process.

The file filtering technology underlying the technology represents almost a quarter of a million lines of code. This code is expected to robustly deal with any stream of bytes, of any length (any file), in all cases. Oracle has dedicated, and continues to dedicate, significant effort into making this technology extremely robust. However, in real world situations, expect that some small number of malformed files may force the filters into unstable states. This generally results in either a memory exception (which can be trapped and recovered from gracefully), infinite loop or a wild pointer that causes the filter to write into memory that is part of the same process but does not belong to the filter. In the latter situation, this wild pointer condition cannot be trapped.

On the desktop this is not a significant problem since the number of files being dealt with is relatively small. In a 24x7 server environment, however, a wild pointer can be extremely disruptive to the server process and produce serious problems. The best solution for dealing with this problem is to run any application that reads complex file formats in a separate process. This solution protects the application from the susceptibility of filtering technology to the unknown quality of input files.

It must be stressed that files that lead to wild pointers or infinite loops occur very infrequently, usually as a result of a third-party conversion process or beta versions of applications. Oracle is committed to addressing these issues and to updating and expanding its testing tools and corpus of documents to proactively minimize this "garbage in-garbage out" problem.

Running in Multiple Threads or Processes

On certain platforms, export products may be run in a multithreaded or multiprocessing application. The thing to remember when doing so is that each thread must go through all the steps listed in Chapter 1, "Introduction".

HTML Export Issues

The following implementation issues apply to HTML Export.

Relative URLs in Templates

Consider the following:

<html>
<body>
<p><img src="image.gif"></p>
{## insert element=sections.1.body}
</body></html>

In most reasonable implementations of HTML Export, the output files will probably be stored in a totally different location than the template files. In this scenario, the output files produced will have a reference to image.gif, which the browser will assume has the same path as the output files. However, image.gif is usually placed in the directory where the template file is located. This is a problem for anything referenced in the template using a relative URL. There are several possible solutions to this problem.

Guarantee the References Are Good

If the developer knows exactly which files all of the templates reference, the correct files (such as image.gif) can be moved to or located in the output directory(s). This solution requires the developer to have exact knowledge of the contents of the templates and may propagate the same set of files into many output locations.

Use Absolute URLs

The developer can design templates to contain absolute URLs to any referenced files. The template fragment in the example would then look something like this:

<img src="http://www.oracle.com/ExportMaps/next.gif" />
<html>
<body>
<p><img src="http://www.oracle.com/templates/image.gif"></p>
{## insert element=sections.1.body}
</body>
</html>

This solution has the drawback that the output files are tied to a certain domain, requiring the developer to generate templates separately for each customer.

Generate Complete URLs Using {## insert oem=}

The developer can design templates to contain {## insert oem=} macros in front of each reference and use the callback this generates to fill in the complete URL. The template in the example would then look something like this.

<html>
<body>
<p><img src="{## insert oem=InsertURL}image.gif"></p>
{## insert element=sections.1.body}
</body>
</html>

The major drawback to this solution is the difficulty in generating templates like this using HTML editor tools such as HotMetalPro or FrontPage.

Use CGI and the <base> tag

At first glance, the base tag may seem an easy way out of this problem. The developer can simply add it to all the templates as follows:

<html>
<body>
<base href="http://www.outsideinsdk.com/templates/">
<p><img src="image.gif"></p>
{## insert element=sections.1.body}
</body>
</html>

However, in this solution the source file may contain graphics (such as embedded graphics in documents) that HTML Export must generate a separate GIF, JPEG, or PNG file to produce. This file is stored by default in the same directory as the initial output file, so the output file might look like this.

<html>
<body>
<base href="http://www.outsideinsdk.com/templates/">
<p><img src="image.gif"></P>
This is a document.
This is a document.
Below is a graphic.
<img src="output1.gif">
</body>
</html>

This output file will not work because the base tag makes the browser look for the file output.gif in the http://www.oracle.com/templates/ directory, which is not in the same location as the output file.

Some applications may use HTML Export in such a way that all the output files are accessed through CGI or a CGI-like construct (NSAPI, ISAPI, Java Servlet, etc.). For instance, some developers may wish to use the EX_CALLBACK_ID_CREATENEWFILE callback to store all the output files in a database instead of a file system. If such redirection is already going on and the developer is not relying on the standard relative URL to absolute URL translation that takes place in the browser, then the base tag is irrelevant to the links generated by HTML Export and the whole thing will work. The output file in this case might be similar to this:

<html>
<body>
<base href="http://www.outsideinsdk.com/templates/">
<p><img src="image.gif"></p>
This is a document.
This is a document.
Below is a graphic.
<img src="/cgi-bin/mycgi.exe?125859458.htm">
</body>
</html>

Have HX copy the files using {## copy}

The developer can have the template copy files to to the output directory by using the {## copy} macro. The example template would then be similar to this:

<html>
<body>
{## copy file=image.gif}
<p><img src="image.gif"></p>
{## insert element=sections.1.body}
</body>
</html>

The drawback to this solution is that separate copies of the file being copied will be placed in the output directory of EVERY conversion using the template. These redundant copies waste disc space and increase conversion times.

Browser Caching

In the process of building and debugging templates, the developer is likely to run the same source file through HTML Export repeatedly with slightly different templates. Depending on how the developer is naming the output files, this may have a tendency to produce the same set of file names repeatedly. In this scenario, especially if the output is being read directly from a file system instead of a Web server, browsers will have the tendency to show the old cached results instead of the new ones. The rule of thumb is: "If it looks like bad output, press Refresh on every frame before deciding whether it's a problem with the template or the software." It may be simpler to empty and turn off caching in your browser while creating and testing your templates.

Errors Returned by HTML Export

The errors that are returned by HTML Export are defined in the file common/sccerr.h. Errors may be added to this list or otherwise changed in future releases. To help minimize the impact of these changes, developers are encouraged to use the #defines for the errors rather than refer to errors by their numeric value.

CSS Considerations

The following information describes issues to consider when using Cascading Style Sheets.

Customizing CSS Styles

One of the most powerful features of Cascading Style Sheets is the ability to override the styles suggested in various ways. HTML Export has designed its CSS support to permit users to override the style sheets that it produces. This in turn allows the user to help blend documents from many authors into a collection that has a more unified look.

In order to override styles, one first needs to understand the style names that can appear in the HTML created by HTML Export, and where they are placed in the output. Styles can be overridden if new style definitions with names that match those generated by HTML Export are placed in the template files after the generated styles. See the documentation for the template elements pragma.cssfile or pragma.embeddedcss to understand how to control where generated styles will be placed in the HTML output.

Style Names Used by HTML Export

Style names are taken from the original style names in the source document. Unfortunately there is an inherent limitation in the style names the CSS standard permits. That standard only permits the characters [a-z][A-Z][0-9] and "-". Source document style names do not necessarily have this restriction. In fact they may even contain Unicode characters at times. For this reason, the original style names may need to be modified to conform to this standard. To avoid illegal style names, HTML Export performs the following substitutions on all source style names:

If the character is a "-", then it is replaced with "--".
If the character is not one of the remaining characters ([a-z][A-Z][0-9]), then it is replaced by "-xxxx" where "xxxx" is the Unicode value of the character in hexadecimal.
Otherwise the character appears in the style name normally.

An example of one of the most common examples of this substitution is that spaces in style names are replaced with "-0020". For a more complete example of this character substitution in style names, consider the source style name My Special H1-Style!. This would be transformed to My-0020Special-0020H1--Style-0021.

While admittedly this system lacks a certain aesthetic, it avoids the problem of how the document looks when the browser receives duplicate or invalid style names. Developers should also appreciate the simplicity of the code needed to parse or create these style names. Users who would prefer more human-readable style names should use the SCCOPT_EX_SIMPLESTYLENAMES option.

In addition, HTML Export sometimes creates special character attribute-only versions of styles. These have the same name as the style they are based on with "--Char" appended to the end. These styles differ from their original counterparts in that they contain no block level CSS. This more general solution replaces the solution implemented in versions 7.1 and earlier which created "--List" styles to solve a subset of this problem. This was done to work around limitations in some browsers.

Overriding HTML Export's Styles

Once style names are understood, it is possible to override the .css file produced by HTML Export. In the template used to export files, follow the reference to pragma.cssfile or pragma.embeddedcss with style definitions that match the names of those styles you wish to redefine. This is possible only if you are aware of the stylenames that will be found in the input document(s) to be exported.

Remember that many file formats allow styles to be based on other previously defined styles. HTML Export supports this by nesting styles. In this way each nested style inherits and may override items defined in the styles that surround it.

pragma.cssfile and {## link}

If an external .css file is being generated, one {## insert element=pragma.cssfile} statement should appear at the top of each HTML template file used for the export. It should be remembered that the {## link} statement may be used to trigger the creation of additional HTML files. As a result, each {## link}ed template will typically contain a <link> to the .css file generated.

It is possible, though, to {## link} to a template that does not have any {##} statements that would need to reference the .css file. In that case, the <link> to the .css file may safely be omitted. For example, consider a template that has only two {##} statements, both {## links (perhaps to put the results into two separate <frame>s). This template file would not need a <link> to the .css file.

Generally, only one .css file will be generated, regardless of how many HTML files are produced by HTML Export (although certain input file types, such as archives, result in output with several .css files). It is also worth repeating here that the <link> to the .css file must occur in the <head> of the document and each resulting HTML file may have only one <head>.

XML and HTML Export

In order for an XML parser to be able to read HTML, the HTML must be well formed. HTML Export can now produce HTML that can be parsed by an XML parser. See "XHTML and Well-Formed HTML" for more details on how to do this and what constitutes well-formed HTML.

While others may be willing to stretch the definition of what XML is, Oracle does not currently claim that HTML Export produces true XML. However, it is true that HTML Export produces HTML that can be parsed by an XML parser. Thus by using an appropriate template, the HTML produced by HTML Export may be wrapped in XML.

The Sample XML Template

To demonstrate how to wrap the output of HTML Export in XML, a sample XML template is included in the HTML Export SDK that produces XML from the output of HTML Export. When using the sample XML template there are some important things to keep in mind.

The output file name must have an .xml extension. This extension is not important to HTML Export. It is important to some browsers however.
The .xsl file must be in the same directory as the output .xml file. There is nothing special about the .xsl file and it does not affect HTML Export in any way.
As of this writing, Microsoft Internet Explorer 5.0 is the only major browser that is capable of rendering the resulting .xml file.

XHTML and Well-Formed HTML

HTML Export is able to produce output that is XHTML compliant and well formed. In order to have this happen, the SCCOPT_EX_COMPLIANCEFLAGS option must have either the SCCEX_CFLAG_STRICT_DTD or the SCCEX_CFLAG_WELLFORMED flags set. Further discussion of XHTML and well-formed HTML in this chapter assume that one of these flags has been used.

The XHTML 1.0 W3C recommendation lists three types of XHTML compliance, Transitional, Frameset and Strict. HTML Export is compliant with both XHTML Transitional and XHTML Frameset. When using HTML Export to produce XHTML it is important to remember that the template being used must be XHTML compliant.

The output of HTML Export has been tested to ensure that it is well formed when one of the proper flags is set in the SCCOPT_EX_COMPLIANCEFLAGS option. This is meaningless, however, unless the template used by HTML Export is also well formed. To assist with creating well-formed templates, here is a list of common problems that cause documents to not be well formed:

All tags must be properly nested.
All tags that are opened must also be closed. This includes tags that are not normally thought of as needing closing tags, including <meta>, <link>, <frame>, <hr> and <br> tags.
Everything after an equals sign must be in double quotes. So <font color="0000ff"> is OK, but <font color=0000ff> is not.
In order for   to appear in a document, a <!DOCTYPE> statement must be in the HTML. Since HTML Export cannot know if the template included the <!DOCTYPE> statement, when the SCCEX_CFLAG_STRICT_DTD flag is set,   is always used instead of  .
Characters in the range 0x80 - 0xFF are to be written in the form &#xxx;.
The only three character codes less than 0x20 allowed in a document are \t, \n and \r.
All attributes of a tag must be followed by "=value." Thus the "nowrap" in <table nowrap> is not well formed. HTML Export uses <table nowrap="nowrap"> instead.

Archive Support

The following information pertains to archive support in HTML Export.

Using Redirected IO with Archive Files

When using redirected IO with input archive files, the OEM must be sure to fully support the IOGetInfo call. It is used by HTML Export to obtain the name of the archive. To that string, HTML Export appends the ItemNum value for use as a default value when creating the reflink template element. HTML Export also executes a call to IOGetInfo to implement pragma.sourcefilename.

Temporary File Creation

Whenever HTML Export needs to access data in a document in an archive file, it extracts the contents of that archived file to a temporary file on the disk. Users should be aware that this might pose a security threat if someone has access to the disk of the machine running HTML Export. This is an issue even when using redirected IO.

Users of redirected IO should also be aware that the pSpec/dwSpecType are set to the values for the temporary files. As a result, redirected IO is cut out of the picture and the redirected IO "file" is closed.

Temporary files are created in two cases. The first is when DAOpenDocument is called on an entry in an archive. The second is when the following code is used to extract and convert a file in the archive file:

{## link element=sections.current.decompressedfile}

Please see the Options Guide for more information about Temporary Files.

Empty Directories in Archive Files

Entries for directories that do not contain files are allowed. Such entries will be considered to have ItemNums, but not section numbers. Thus, when looping (unsorted) through sections in an archive, there may be gaps in the ItemNums seen. These gaps correspond to directories that do not contain files.

Finding the Total Number of Files in an Archive

In order to determine the total number of files in an archive, write a template that retrieves number=sections.count. The number of sections is equal to the number of files in the archive, not the number of entries in the archive file.

Positional Frames Support

HTML Export uses DHTML to position objects with greater accuracy. However, only two types of object positioning are supported: paragraph anchored objects and page anchored objects. The following are notes about this initial support for positional frames:

HTML Export generates paragraph objects separately from page objects, even if it appears that they should be placed in the same location.
Transparency is not supported when separate graphics items are placed on top of one another. The SCCOPT_EX_PREVENTGRAPHICOVERLAP option does not apply to these graphics. The graphics will appear relative to where the anchor point is, not relative to the text in the document. Additionally, HTML Export does not support certain graphics effects, such as rotation or stretching.
The SCCOPT_EX_GRAPHICOUTPUTDPI option must be set properly to achieve best results.
In some cases, HTML Export will produce output with inaccurately placed objects when the input document features positional frame objects. We are implementing this feature despite these occasional errors, as this end result is no worse than the end result when handling positional frame objects in earlier versions of HTML Export (the graphics would be placed in a long column).
This feature only works in the 4.0 versions of HTML.

Limitations of Multimedia File Support

Support for the multimedia file type is rather limited at this time. Currently, only one filter uses it (the id3 filter), which only supports MP3 files. From these files, only text properties may be extracted. The named properties are:

property.title
property.album
property.artist

All other properties must be accessed via the property.all or property.others macros. Since only text properties are supported, no embeddings (album cover graphics) are available.

At this time, the body and title parts of these files are not supported. An example of the unsupported body content would be the actual musical content of an MP3 file. While title is not supported, property.title is, provided the information is present in the source document. If a template attempts to insert sections.x.body, sections.x.title, or any of their aliases or sub nodes, nothing will be inserted.