FAQ
History
PreviousHomeNext Search
Feedback
Divider

Generating XML Data

This section also takes you step by step through the process of constructing an XML document. Along the way, you'll gain experience with the XML components you'll typically use to create your data structures.

Writing a Simple XML File

You'll start by writing the kind of XML data you could use for a slide presentation. In this exercise, you'll use your text editor to create the data in order to become comfortable with the basic format of an XML file. You'll be using this file and extending it in later exercises.

Creating the File

Using a standard text editor, create a file called slideSample.xml.


Note: Here is a version of it that already exists: slideSample01.xml. (The browsable version is slideSample01-xml.html.) You can use this version to compare your work, or just review it as you read this guide.


Writing the Declaration

Next, write the declaration, which identifies the file as an XML document. The declaration starts with the characters "<?", which is the standard XML identifier for a processing instruction. (You'll see other processing instructions later on in this tutorial.)

  <?xml version='1.0' encoding='utf-8'?>  

This line identifies the document as an XML document that conforms to version 1.0 of the XML specification, and says that it uses the 8-bit Unicode character-encoding scheme. (For information on encoding schemes, see Java Encoding Schemes.)

Since the document has not been specified as "standalone", the parser assumes that it may contain references to other documents. To see how to specify a document as "standalone", see The XML Prolog.

Adding a Comment

Comments are ignored by XML parsers. A program will never see them in fact, unless you activate special settings in the parser. Add the text highlighted below to put a comment into the file.

<?xml version='1.0' encoding='utf-8'?> 

<!-- A SAMPLE set of slides -->  

After the declaration, every XML file defines exactly one element, known as the root element. Any other elements in the file are contained within that element. Enter the text highlighted below to define the root element for this file, slideshow:

<?xml version='1.0' encoding='utf-8'?> 

<!-- A SAMPLE set of slides --> 

<slideshow> 

</slideshow> 

Note: XML element names are case-sensitive. The end-tag must exactly match the start-tag.


Adding Attributes to an Element

A slide presentation has a number of associated data items, none of which require any structure. So it is natural to define them as attributes of the slideshow element. Add the text highlighted below to set up some attributes:

...
  <slideshow 
    title="Sample Slide Show"
    date="Date of publication"
    author="Yours Truly"
    >
  </slideshow> 

When you create a name for a tag or an attribute, you can use hyphens ("-"), underscores ("_"), colons (":"), and periods (".") in addition to characters and numbers. Unlike HTML, values for XML attributes are always in quotation marks, and multiple attributes are never separated by commas.


Note: Colons should be used with care or avoided altogether, because they are used when defining the namespace for an XML document.


Adding Nested Elements

XML allows for hierarchically structured data, which means that an element can contain other elements. Add the text highlighted below to define a slide element and a title element contained within it:

<slideshow 
  ...
  >

   <!-- TITLE SLIDE -->
  <slide type="all">
    <title>Wake up to WonderWidgets!</title>
  </slide>

</slideshow> 

Here you have also added a type attribute to the slide. The idea of this attribute is that slides could be earmarked for a mostly technical or mostly executive audience with type="tech" or type="exec", or identified as suitable for both with type="all".

More importantly, though, this example illustrates the difference between things that are more usefully defined as elements (the title element) and things that are more suitable as attributes (the type attribute). The visibility heuristic is primarily at work here. The title is something the audience will see. So it is an element. The type, on the other hand, is something that never gets presented, so it is an attribute. Another way to think about that distinction is that an element is a container, like a bottle. The type is a characteristic of the container (is it tall or short, wide or narrow). The title is a characteristic of the contents (water, milk, or tea). These are not hard and fast rules, of course, but they can help when you design your own XML structures.

Adding HTML-Style Text

Since XML lets you define any tags you want, it makes sense to define a set of tags that look like HTML. The XHTML standard does exactly that, in fact. You'll see more about that towards the end of the SAX tutorial. For now, type the text highlighted below to define a slide with a couple of list item entries that use an HTML-style <em> tag for emphasis (usually rendered as italicized text):

  ...
  <!-- TITLE SLIDE -->
  <slide type="all">
    <title>Wake up to WonderWidgets!</title>
  </slide>

  <!-- OVERVIEW -->
  <slide type="all">
    <title>Overview</title>
      <item>Why <em>WonderWidgets</em> are great</item>
      <item>Who <em>buys</em> WonderWidgets</item>
  </slide>

</slideshow> 

Note that defining a title element conflicts with the XHTML element that uses the same name. We'll discuss the mechanism that produces the conflict (the DTD), along with possible solutions, later on in this tutorial.

Adding an Empty Element

One major difference between HTML and XML, though, is that all XML must be well-formed -- which means that every tag must have an ending tag or be an empty tag. You're getting pretty comfortable with ending tags, by now. Add the text highlighted below to define an empty list item element with no contents:

  ...
  <!-- OVERVIEW -->
  <slide type="all">
    <title>Overview</title>
    <item>Why <em>WonderWidgets</em> are great</item>
    <item/>
    <item>Who <em>buys</em> WonderWidgets</item>
  </slide>

</slideshow> 

Note that any element can be empty element. All it takes is ending the tag with "/>" instead of ">". You could do the same thing by entering <item></item>, which is equivalent.


Note: Another factor that makes an XML file well-formed is proper nesting. So <b><i>some_text</i></b> is well-formed, because the <i>...</i> sequence is completely nested within the <b>..</b> tag. This sequence, on the other hand, is not well-formed: <b><i>some_text</b></i>.


The Finished Product

Here is the completed version of the XML file:

<?xml version='1.0' encoding='utf-8'?>

<!--  A SAMPLE set of slides  --> 
<slideshow 
  title="Sample Slide Show"
  date="Date of publication"
  author="Yours Truly"
  >

  <!-- TITLE SLIDE -->
  <slide type="all">
    <title>Wake up to WonderWidgets!</title>
  </slide>

  <!-- OVERVIEW -->
  <slide type="all">
    <title>Overview</title>
    <item>Why <em>WonderWidgets</em> are great</item>
    <item/>
    <item>Who <em>buys</em> WonderWidgets</item>
  </slide
</slideshow> 

Save a copy of this file as slideSample01.xml, so you can use it as the initial data structure when experimenting with XML programming operations.

Writing Processing Instructions

It sometimes makes sense to code application-specific processing instructions in the XML data. In this exercise, you'll add a processing instruction to your slideSample.xml file.


Note: The file you'll create in this section is slideSample02.xml. (The browsable version is slideSample02-xml.html.)


As you saw in Processing Instructions, the format for a processing instruction is <?target data?>, where "target" is the target application that is expected to do the processing, and "data" is the instruction or information for it to process. Add the text highlighted below to add a processing instruction for a mythical slide presentation program that will query the user to find out which slides to display (technical, executive-level, or all):

<slideshow 
  ...
  > 
  <!-- PROCESSING INSTRUCTION -->
  <?my.presentation.Program QUERY="exec, tech, all"?> 
  <!-- TITLE SLIDE --> 

Notes:

The colon makes the target name into a kind of "label" that identifies the intended recipient of the instruction. However, while the w3c spec allows ":" in a target name, some versions of IE5 consider it an error. For this tutorial, then, we avoid using a colon in the target name.

Save a copy of this file as slideSample02.xml, so you can use it when experimenting with processing instructions.

Introducing an Error

The parser can generate one of three kinds of errors: fatal error, error, and warning. In this exercise, you'll make a simple modification to the XML file to introduce a fatal error. Then you'll see how it's handled in the Echo app.


Note: The XML structure you'll create in this exercise is in slideSampleBad1.xml. (The browsable version is slideSampleBad1-xml.html.)


One easy way to introduce a fatal error is to remove the final "/" from the empty item element to create a tag that does not have a corresponding end tag. That constitutes a fatal error, because all XML documents must, by definition, be well formed. Do the following:

  1. Copy slideSample02.xml to slideSampleBad1.xml.
  2. Edit slideSampleBad1.xml and remove the character shown below:
  3. ...
    <!-- OVERVIEW -->
    <slide type="all">
      <title>Overview</title>
      <item>Why <em>WonderWidgets</em> are great</item>
      <item/>
      <item>Who <em>buys</em> WonderWidgets</item>
    </slide>
    ... 
    

to produce:

...
<item>Why <em>WonderWidgets</em> are great</item>
<item>
<item>Who <em>buys</em> WonderWidgets</item> 
... 
Now you have a file that you can use to generate an error in any 
parser, any time. (XML parsers are required to generate a fatal 
error for this file, because the lack of an end-tag for the 
<item> element means that the XML structure is no longer well-
formed.) 

Substituting and Inserting Text

In this section, you'll learn about:

Handling Special Characters

In XML, an entity is an XML structure (or plain text) that has a name. Referencing the entity by name causes it to be inserted into the document in place of the entity reference. To create an entity reference, the entity name is surrounded by an ampersand and a semicolon, like this:

  &entityName; 

Later, when you learn how to write a DTD, you'll see that you can define your own entities, so that &yourEntityName; expands to all the text you defined for that entity. For now, though, we'll focus on the predefined entities and character references that don't require any special definitions.

Predefined Entities

An entity reference like &amp; contains a name (in this case, "amp") between the start and end delimiters. The text it refers to (&) is substituted for the name, like a macro in a programming language. Table 7-1 shows the predefined entities for special characters.

Table 7-1 Predefined Entities
Character
Reference
&
&amp;
<
&lt;
>
&gt;
"
&quot;
'
&apos;

Character References

A character reference like &#147; contains a hash mark (#) followed by a number. The number is the Unicode value for a single character, such as 65 for the letter "A", 147 for the left-curly quote, or 148 for the right-curly quote. In this case, the "name" of the entity is the hash mark followed by the digits that identify the character.


Note: XML expects values to be specified in decimal. However, the Unicode charts at http://www.unicode.org/charts/ specify values in hexadecimal! So you'll need to do a conversion to get the right value to insert into your XML data set.


Using an Entity Reference in an XML Document

Suppose you wanted to insert a line like this in your XML document:

 Market Size < predicted 

The problem with putting that line into an XML file directly is that when the parser sees the left-angle bracket (<), it starts looking for a tag name, which throws off the parse. To get around that problem, you put &lt; in the file, instead of "<".


Note: The results of the modifications below are contained in slideSample03.xml.


Add the text highlighted below to your slideSample.xml file, and save a copy of it for future use as slideSample03.xml:

  <!-- OVERVIEW -->
  <slide type="all">
    <title>Overview</title>
    ...
  </slide> 
  <slide type="exec">
    <title>Financial Forecast</title>
    <item>Market Size &lt; predicted</item>
    <item>Anticipated Penetration</item>
    <item>Expected Revenues</item>
    <item>Profit Margin </item>
  </slide> 
</slideshow> 

When you use an XML parser to echo this data, you will see the desired output:

Market Size < predicted 

You see an angle bracket ("<") where you coded "&lt;", because the XML parser converts the reference into the entity it represents, and passes that entity to the application.

Handling Text with XML-Style Syntax

When you are handling large blocks of XML or HTML that include many of the special characters, it would be inconvenient to replace each of them with the appropriate entity reference. For those situations, you can use a CDATA section.


Note: The results of the modifications below are contained in slideSample04.xml.


A CDATA section works like <pre>...</pre> in HTML, only more so--all whitespace in a CDATA section is significant, and characters in it are not interpreted as XML. A CDATA section starts with <![CDATA[ and ends with ]]>.

Add the text highlighted below to your slideSample.xml file to define a CDATA section for a fictitious technical slide, and save a copy of the file as slideSample04.xml:

   ...
  <slide type="tech">
    <title>How it Works</title>
    <item>First we fozzle the frobmorten</item>
    <item>Then we framboze the staten</item>
    <item>Finally, we frenzle the fuznaten</item>
    <item><![CDATA[Diagram:
      frobmorten <--------------- fuznaten
        |    <3>  ^
        | <1>    |  <1> = fozzle
        V     |  <2> = framboze 
        Staten--------------------+      <3> = frenzle
           <2>
    ]]></item>
  </slide>
</slideshow> 

When you echo this file with an XML parser, you'll see the following output:

Diagram:
frobmorten <--------------fuznaten
     |          <3>          ^
     | <1>                  |   <1> = fozzle
    V                  |   <2> = framboze 
  staten----------------------+   <3> = frenzle
           <2> 

The point here is that the text in the CDATA section will have arrived as it was written. Since the parser doesn't treat the angle brackets as XML, they don't generate the fatal errors they would otherwise cause. (Because, if the angle brackets weren't in a CDATA section, the document would not be well-formed.)

Creating a Document Type Definition (DTD)

After the XML declaration, the document prolog can include a DTD, which lets you specify the kinds of tags that can be included in your XML document. In addition to telling a validating parser which tags are valid, and in what arrangements, a DTD tells both validating and nonvalidating parsers where text is expected, which lets the parser determine whether the whitespace it sees is significant or ignorable.

Basic DTD Definitions

To begin learning about DTD definitions, let's start by telling the parser where text is expected and where any text (other than whitespace) would be an error. (Whitespace in such locations is ignorable.)


Note: The DTD defined in this section is contained in slideshow1a.dtd. (The browsable version is slideshow1a-dtd.html.)


Start by creating a file named slideshow.dtd. Enter an XML declaration and a comment to identify the file, as shown below:

<?xml version='1.0' encoding='utf-8'?> 
<!-- 
  DTD for a simple "slide show". 
--> 

Next, add the text highlighted below to specify that a slideshow element contains slide elements and nothing else:

<!-- DTD for a simple "slide show". --> 
<!ELEMENT slideshow (slide+)> 

As you can see, the DTD tag starts with <! followed by the tag name (ELEMENT). After the tag name comes the name of the element that is being defined (slideshow) and, in parentheses, one or more items that indicate the valid contents for that element. In this case, the notation says that a slideshow consists of one or more slide elements.

Without the plus sign, the definition would be saying that a slideshow consists of a single slide element. The qualifiers you can add to an element definition are listed in Table 7-2.

Table 7-2 DTD Element Qualifiers 
Qualifier
Name
Meaning
?
Question Mark
Optional (zero or one)
*
Asterisk
Zero or more
+
Plus Sign
One or more

You can include multiple elements inside the parentheses in a comma separated list, and use a qualifier on each element to indicate how many instances of that element may occur. The comma-separated list tells which elements are valid and the order they can occur in.

You can also nest parentheses to group multiple items. For an example, after defining an image element (coming up shortly), you could declare that every image element must be paired with a title element in a slide by specifying ((image, title)+). Here, the plus sign applies to the image/title pair to indicate that one or more pairs of the specified items can occur.

Defining Text and Nested Elements

Now that you have told the parser something about where not to expect text, let's see how to tell it where text can occur. Add the text highlighted below to define the slide, title, item, and list elements:

<!ELEMENT slideshow (slide+)>
<!ELEMENT slide (title, item*)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT item (#PCDATA | item)* > 

The first line you added says that a slide consists of a title followed by zero or more item elements. Nothing new there. The next line says that a title consists entirely of parsed character data (PCDATA). That's known as "text" in most parts of the country, but in XML-speak it's called "parsed character data". (That distinguishes it from CDATA sections, which contain character data that is not parsed.) The "#" that precedes PCDATA indicates that what follows is a special word, rather than an element name.

The last line introduces the vertical bar (|), which indicates an or condition. In this case, either PCDATA or an item can occur. The asterisk at the end says that either one can occur zero or more times in succession. The result of this specification is known as a mixed-content model, because any number of item elements can be interspersed with the text. Such models must always be defined with #PCDATA specified first, some number of alternate items divided by vertical bars (|), and an asterisk (*) at the end.

Save a copy of this DTD as slideSample1a.dtd, for use when experimenting with basic DTD processing.

Limitations of DTDs

It would be nice if we could specify that an item contains either text, or text followed by one or more list items. But that kind of specification turns out to be hard to achieve in a DTD. For example, you might be tempted to define an item like this:

<!ELEMENT item (#PCDATA | (#PCDATA, item+)) > 

That would certainly be accurate, but as soon as the parser sees #PCDATA and the vertical bar, it requires the remaining definition to conform to the mixed-content model. This specification doesn't, so you get can error that says: Illegal mixed content model for 'item'. Found &#x28; ..., where the hex character 28 is the angle bracket the ends the definition.

Trying to double-define the item element doesn't work, either. A specification like this:

<!ELEMENT item (#PCDATA) >
<!ELEMENT item (#PCDATA, item+) > 

produces a "duplicate definition" warning when the validating parser runs. The second definition is, in fact, ignored. So it seems that defining a mixed content model (which allows item elements to be interspersed in text) is about as good as we can do.

In addition to the limitations of the mixed content model mentioned above, there is no way to further qualify the kind of text that can occur where PCDATA has been specified. Should it contain only numbers? Should be in a date format, or possibly a monetary format? There is no way to say in the context of a DTD.

Finally, note that the DTD offers no sense of hierarchy. The definition for the title element applies equally to a slide title and to an item title. When we expand the DTD to allow HTML-style markup in addition to plain text, it would make sense to restrict the size of an item title compared to a slide title, for example. But the only way to do that would be to give one of them a different name, such as "item-title". The bottom line is that the lack of hierarchy in the DTD forces you to introduce a "hyphenation hierarchy" (or its equivalent) in your namespace. All of these limitations are fundamental motivations behind the development of schema-specification standards.

Special Element Values in the DTD

Rather than specifying a parenthesized list of elements, the element definition could use one of two special values: ANY or EMPTY. The ANY specification says that the element may contain any other defined element, or PCDATA. Such a specification is usually used for the root element of a general-purpose XML document such as you might create with a word processor. Textual elements could occur in any order in such a document, so specifying ANY makes sense.

The EMPTY specification says that the element contains no contents. So the DTD for e-mail messages that let you "flag" the message with <flag/> might have a line like this in the DTD:

<!ELEMENT flag EMPTY> 

Referencing the DTD

In this case, the DTD definition is in a separate file from the XML document. That means you have to reference it from the XML document, which makes the DTD file part of the external subset of the full Document Type Definition (DTD) for the XML file. As you'll see later on, you can also include parts of the DTD within the document. Such definitions constitute the local subset of the DTD.


Note: The XML written in this section is contained in slideSample05.xml. (The browsable version is slideSample05-xml.html.)


To reference the DTD file you just created, add the line highlighted below to your slideSample.xml file, and save a copy of the file as slideSample05.xml:

<!--  A SAMPLE set of slides  --> 
<!DOCTYPE slideshow SYSTEM "slideshow.dtd"> 
<slideshow 

Again, the DTD tag starts with "<!". In this case, the tag name, DOCTYPE, says that the document is a slideshow, which means that the document consists of the slideshow element and everything within it:

<slideshow>
...
</slideshow> 

This tag defines the slideshow element as the root element for the document. An XML document must have exactly one root element. This is where that element is specified. In other words, this tag identifies the document content as a slideshow.

The DOCTYPE tag occurs after the XML declaration and before the root element. The SYSTEM identifier specifies the location of the DTD file. Since it does not start with a prefix like http:/ or file:/, the path is relative to the location of the XML document. Remember the setDocumentLocator method? The parser is using that information to find the DTD file, just as your application would to find a file relative to the XML document. A PUBLIC identifier could also be used to specify the DTD file using a unique name--but the parser would have to be able to resolve it

The DOCTYPE specification could also contain DTD definitions within the XML document, rather than referring to an external DTD file. Such definitions would be contained in square brackets, like this:

<!DOCTYPE slideshow SYSTEM "slideshow1.dtd" [
  ...local subset definitions here...
]> 

You'll take advantage of that facility in a moment to define some entities that can be used in the document.

Documents and Data

Earlier, you learned that one reason you hear about XML documents, on the one hand, and XML data, on the other, is that XML handles both comfortably, depending on whether text is or is not allowed between elements in the structure.

In the sample file you have been working with, the slideshow element is an example of a data element--it contains only subelements with no intervening text. The item element, on the other hand, might be termed a document element, because it is defined to include both text and subelements.

As you work through this tutorial, you will see how to expand the definition of the title element to include HTML-style markup, which will turn it into a document element as well.

Defining Attributes and Entities in the DTD

The DTD you've defined so far is fine for use with the nonvalidating parser. It tells where text is expected and where it isn't, which is all the nonvalidating parser is going to pay attention to. But for use with the validating parser, the DTD needs to specify the valid attributes for the different elements. You'll do that in this section, after which you'll define one internal entity and one external entity that you can reference in your XML file.

Defining Attributes in the DTD

Let's start by defining the attributes for the elements in the slide presentation.


Note: The XML written in this section is contained in slideshow1b.dtd. (The browsable version is slideshow1b-dtd.html.)


Add the text highlighted below to define the attributes for the slideshow element:

<!ELEMENT slideshow (slide+)>
<!ATTLIST slideshow 
    title    CDATA    #REQUIRED
    date     CDATA    #IMPLIED
    author   CDATA    "unknown"
>
<!ELEMENT slide (title, item*)> 

The DTD tag ATTLIST begins the series of attribute definitions. The name that follows ATTLIST specifies the element for which the attributes are being defined. In this case, the element is the slideshow element. (Note once again the lack of hierarchy in DTD specifications.)

Each attribute is defined by a series of three space-separated values. Commas and other separators are not allowed, so formatting the definitions as shown above is helpful for readability. The first element in each line is the name of the attribute: title, date, or author, in this case. The second element indicates the type of the data: CDATA is character data--unparsed data, once again, in which a left-angle bracket (<) will never be construed as part of an XML tag. Table 7-3 presents the valid choices for the attribute type.

Table 7-3 Attribute Types
Attribute Type
Specifies...
(value1 | value2 | ...)
A list of values separated by vertical bars. (Example below)
CDATA
"Unparsed character data". (For normal people, a text string.)
ID
A name that no other ID attribute shares.
IDREF
A reference to an ID defined elsewhere in the document.
IDREFS
A space-separated list containing one or more ID references.
ENTITY
The name of an entity defined in the DTD.
ENTITIES
A space-separated list of entities.
NMTOKEN
A valid XML name composed of letters, numbers, hyphens, underscores, and colons.
NMTOKENS
A space-separated list of names.
NOTATION
The name of a DTD-specified notation, which describes a non-XML data format, such as those used for image files.*

*This is a rapidly obsolescing specification which will be discussed in greater length towards the end of this section.

When the attribute type consists of a parenthesized list of choices separated by vertical bars, the attribute must use one of the specified values. For an example, add the text highlighted below to the DTD:

<!ELEMENT slide (title, item*)>
<!ATTLIST slide 
    type   (tech | exec | all) #IMPLIED
>
<!ELEMENT title (#PCDATA)>
<!ELEMENT item (#PCDATA | item)* > 

This specification says that the slide element's type attribute must be given as type="tech", type="exec", or type="all". No other values are acceptable. (DTD-aware XML editors can use such specifications to present a pop-up list of choices.)

The last entry in the attribute specification determines the attributes default value, if any, and tells whether or not the attribute is required. Table 7-4 shows the possible choices.

Table 7-4 Attribute-Specification Parameters
Specification
Specifies...
#REQUIRED
The attribute value must be specified in the document.
#IMPLIED
The value need not be specified in the document. If it isn't, the application will have a default value it uses.
"defaultValue"
The default value to use, if a value is not specified in the document.
#FIXED "fixedValue"
The value to use. If the document specifies any value at all, it must be the same.

Finally, save a copy of the DTD as slideshow1b.dtd, for use when experimenting with attribute definitions.

Defining Entities in the DTD

So far, you've seen predefined entities like &amp; and you've seen that an attribute can reference an entity. It's time now for you to learn how to define entities of your own.


Note: The XML you'll create here is contained in slideSample06.xml. (The browsable version is slideSample06-xml.html.)


Add the text highlighted below to the DOCTYPE tag in your XML file:

<!DOCTYPE slideshow SYSTEM "slideshow.dtd" [
  <!ENTITY product  "WonderWidget">
  <!ENTITY products "WonderWidgets">
]> 

The ENTITY tag name says that you are defining an entity. Next comes the name of the entity and its definition. In this case, you are defining an entity named "product" that will take the place of the product name. Later when the product name changes (as it most certainly will), you will only have to change the name one place, and all your slides will reflect the new value.

The last part is the substitution string that replaces the entity name whenever it is referenced in the XML document. The substitution string is defined in quotes, which are not included when the text is inserted into the document.

Just for good measure, we defined two versions, one singular and one plural, so that when the marketing mavens come up with "Wally" for a product name, you will be prepared to enter the plural as "Wallies" and have it substituted correctly.


Note: Truth be told, this is the kind of thing that really belongs in an external DTD. That way, all your documents can reference the new name when it changes. But, hey, this is an example...


Now that you have the entities defined, the next step is to reference them in the slide show. Make the changes highlighted below to do that:

<slideshow 
  title="WonderWidget&product; Slide Show" 
  ... 
  <!-- TITLE SLIDE -->
  <slide type="all">
    <title>Wake up to WonderWidgets&products;!</title>
  </slide> 
   <!-- OVERVIEW -->
  <slide type="all">
    <title>Overview</title>
    <item>Why <em>WonderWidgets&products;</em> are 
great</item>
    <item/>
    <item>Who <em>buys</em> WonderWidgets&products;</item>
  </slide> 

The points to notice here are that entities you define are referenced with the same syntax (&entityName;) that you use for predefined entities, and that the entity can be referenced in an attribute value as well as in an element's contents.

When you echo this version of the file with an XML parser, here is the kind of thing you'll see:

Wake up to WonderWidgets! 

Note that the product name has been substituted for the entity reference.

To finish, save a copy of the file as slideSample06.xml.

Additional Useful Entities

Here are several other examples for entity definitions that you might find useful when you write an XML document:

<!ENTITY ldquo  "&#147;"> <!-- Left Double Quote --> 
<!ENTITY rdquo  "&#148;"> <!-- Right Double Quote -->
<!ENTITY trade  "&#153;"> <!-- Trademark Symbol (TM) -->
<!ENTITY rtrade "&#174;"> <!-- Registered Trademark (R) -->
<!ENTITY copyr  "&#169;"> <!-- Copyright Symbol -->  

Referencing External Entities

You can also use the SYSTEM or PUBLIC identifier to name an entity that is defined in an external file. You'll do that now.


Note: The XML defined here is contained in slideSample07.xml and in copyright.xml. (The browsable versions are slideSample07-xml.html and copyright-xml.html.)


To reference an external entity, add the text highlighted below to the DOCTYPE statement in your XML file:

<!DOCTYPE slideshow SYSTEM "slideshow.dtd" [
  <!ENTITY product  "WonderWidget">
  <!ENTITY products "WonderWidgets">
  <!ENTITY copyright SYSTEM "copyright.xml">
]> 

This definition references a copyright message contained in a file named copyright.xml. Create that file and put some interesting text in it, perhaps something like this:

  <!--  A SAMPLE copyright  --> 
This is the standard copyright message that our lawyers
make us put everywhere so we don't have to shell out a
million bucks every time someone spills hot coffee in their
lap... 

Finally, add the text highlighted below to your slideSample.xml file to reference the external entity, and save a copy of the file as slideSample07.html:

<!-- TITLE SLIDE -->
  ...
</slide> 
<!-- COPYRIGHT SLIDE -->
<slide type="all">
  <item>&copyright;</item>
</slide> 

You could also use an external entity declaration to access a servlet that produces the current date using a definition something like this:

<!ENTITY currentDate SYSTEM
  "http://www.example.com/servlet/CurrentDate?fmt=dd-MMM-
yyyy">  

You would then reference that entity the same as any other entity:

  Today's date is &currentDate;. 

When you echo the latest version of the slide presentation with an XML parser, here is what you'll see:

...
<slide type="all">
  <item>
This is the standard copyright message that our lawyers
make us put everywhere so we don't have to shell out a
million bucks every time someone spills hot coffee in their
lap...
  </item>
</slide>
... 

You'll notice that the newline which follows the comment in the file is echoed as a character, but that the comment itself is ignored. That is the reason that the copyright message appears to start on the next line after the <item> element, instead of on the same line--the first character echoed is actually the newline that follows the comment.

Summarizing Entities

An entity that is referenced in the document content, whether internal or external, is termed a general entity. An entity that contains DTD specifications that are referenced from within the DTD is termed a parameter entity. (More on that later.)

An entity which contains XML (text and markup), and which is therefore parsed, is known as a parsed entity. An entity which contains binary data (like images) is known as an unparsed entity. (By its very nature, it must be external.) We'll be discussing references to unparsed entities in the next section of this tutorial.

Referencing Binary Entities

This section discusses the options for referencing binary files like image files and multimedia data files.

Using a MIME Data Type

There are two ways to go about referencing an unparsed entity like a binary image file. One is to use the DTD's NOTATION-specification mechanism. However, that mechanism is a complex, non-intuitive holdover that mostly exists for compatibility with SGML documents. We will have occasion to discuss it in a bit more depth when we look at the DTDHandler API, but suffice it for now to say that the combination of the recently defined XML namespaces standard, in conjunction with the MIME data types defined for electronic messaging attachments, together provide a much more useful, understandable, and extensible mechanism for referencing unparsed external entities.


Note: The XML described here is in slideshow1b.dtd. It shows how binary references can be made, assuming that the application which will be processing the XML data knows how to handle such references.


To set up the slideshow to use image files, add the text highlighted below to your slideshow1b.dtd file:

<!ELEMENT slide (image?, title, item*)>
<!ATTLIST slide 
    type   (tech | exec | all) #IMPLIED
>
<!ELEMENT title (#PCDATA)>
<!ELEMENT item (#PCDATA | item)* >
<!ELEMENT image EMPTY>
<!ATTLIST image 
    alt    CDATA    #IMPLIED
    src    CDATA    #REQUIRED
    type   CDATA    "image/gif"
> 

These modifications declare image as an optional element in a slide, define it as empty element, and define the attributes it requires. The image tag is patterned after the HTML 4.0 tag, img, with the addition of an image-type specifier, type. (The img tag is defined in the HTML 4.0 Specification.)

The image tag's attributes are defined by the ATTLIST entry. The alt attribute, which defines alternate text to display in case the image can't be found, accepts character data (CDATA). It has an "implied" value, which means that it is optional, and that the program processing the data knows enough to substitute something like "Image not found". On the other hand, the src attribute, which names the image to display, is required.

The type attribute is intended for the specification of a MIME data type, as defined at http://www.iana.org/assignments/media-types/. It has a default value: image/gif.


Note: It is understood here that the character data (CDATA) used for the type attribute will be one of the MIME data types. The two most common formats are: image/gif, and image/jpeg. Given that fact, it might be nice to specify an attribute list here, using something like:

type ("image/gif", "image/jpeg")

That won't work, however, because attribute lists are restricted to name tokens. The forward slash isn't part of the valid set of name-token characters, so this declaration fails. Besides that, creating an attribute list in the DTD would limit the valid MIME types to those defined today. Leaving it as CDATA leaves things more open ended, so that the declaration will continue to be valid as additional types are defined.


In the document, a reference to an image named "intro-pic" might look something like this:

<image src="image/intro-pic.gif", alt="Intro Pic", 
type="image/gif" /> 

The Alternative: Using Entity References

Using a MIME data type as an attribute of an element is a mechanism that is flexible and expandable. To create an external ENTITY reference using the notation mechanism, you need DTD NOTATION elements for JPEG and GIF data. Those can of course be obtained from some central repository. But then you need to define a different ENTITY element for each image you intend to reference! In other words, adding a new image to your document always requires both a new entity definition in the DTD and a reference to it in the document. Given the anticipated ubiquity of the HTML 4.0 specification, the newer standard is to use the MIME data types and a declaration like image, which assumes the application knows how to process such elements.

Defining Parameter Entities and Conditional Sections

Just as a general entity lets you reuse XML data in multiple places, a parameter entity lets you reuse parts of a DTD in multiple places. In this section of the tutorial, you'll see how to define and use parameter entities. You'll also see how to use parameter entities with conditional sections in a DTD.

Creating and Referencing a Parameter Entity

Recall that the existing version of the slide presentation could not be validated because the document used <em> tags, and those are not part of the DTD. In general, we'd like to use a whole variety of HTML-style tags in the text of a slide, not just one or two, so it makes more sense to use an existing DTD for XHTML than it does to define all the tags we might ever need. A parameter entity is intended for exactly that kind of purpose.


Note: The DTD specifications shown here are contained in slideshow2.dtd and xhtml.dtd. The XML file that references it is slideSample08.xml. (The browsable versions are slideshow2-dtd.html and slideSample08-xml.html.)


Open your DTD file for the slide presentation and add the text highlighted below to define a parameter entity that references an external DTD file:

<!ELEMENT slide (image?, title?, item*)>
<!ATTLIST slide 
      ...
> 
<!ENTITY % xhtml SYSTEM "xhtml.dtd">
%xhtml; 
<!ELEMENT title ... 

Here, you used an <!ENTITY> tag to define a parameter entity, just as for a general entity, but using a somewhat different syntax. You included a percent sign (%) before the entity name when you defined the entity, and you used the percent sign instead of an ampersand when you referenced it.

Also, note that there are always two steps for using a parameter entity. The first is to define the entity name. The second is to reference the entity name, which actually does the work of including the external definitions in the current DTD. Since the URI for an external entity could contain slashes (/) or other characters that are not valid in an XML name, the definition step allows a valid XML name to be associated with an actual document. (This same technique is used in the definition of namespaces, and anywhere else that XML constructs need to reference external documents.)

Notes:

The whole point of using an XHTML-based DTD was to gain access to an entity it defines that covers HTML-style tags like <em> and <b>. Looking through xhtml.dtd reveals the following entity, which does exactly what we want:

  <!ENTITY % inline "#PCDATA|em|b|a|img|br">  

This entity is a simpler version of those defined in the Modularized XHTML draft. It defines the HTML-style tags we are most likely to want to use -- emphasis, bold, and break, plus a couple of others for images and anchors that we may or may not use in a slide presentation. To use the inline entity, make the changes highlighted below in your DTD file:

<!ELEMENT title (#PCDATA %inline;)*>
<!ELEMENT item (#PCDATA %inline; | item)* > 

These changes replaced the simple #PCDATA item with the inline entity. It is important to notice that #PCDATA is first in the inline entity, and that inline is first wherever we use it. That is required by XML's definition of a mixed-content model. To be in accord with that model, you also had to add an asterisk at the end of the title definition.

Save the DTD as slideshow2.dtd, for use when experimenting with parameter entities.


Note: The Modularized XHTML DTD defines both inline and Inline entities, and does so somewhat differently. Rather than specifying #PCDATA|em|b|a|img|Br, their definitions are more like (#PCDATA|em|b|a|img|Br)*. Using one of those definitions, therefore, looks more like this:

<!ELEMENT title %Inline; >


Conditional Sections

Before we proceed with the next programming exercise, it is worth mentioning the use of parameter entities to control conditional sections. Although you cannot conditionalize the content of an XML document, you can define conditional sections in a DTD that become part of the DTD only if you specify include. If you specify ignore, on the other hand, then the conditional section is not included.

Suppose, for example, that you wanted to use slightly different versions of a DTD, depending on whether you were treating the document as an XML document or as a SGML document. You could do that with DTD definitions like the following:

someExternal.dtd: 
  <![ INCLUDE [
    ... XML-only definitions
  ]]>
  <![ IGNORE [
    ... SGML-only definitions
  ]]>
  ... common definitions  

The conditional sections are introduced by "<![", followed by the INCLUDE or IGNORE keyword and another "[". After that comes the contents of the conditional section, followed by the terminator: "]]>". In this case, the XML definitions are included, and the SGML definitions are excluded. That's fine for XML documents, but you can't use the DTD for SGML documents. You could change the keywords, of course, but that only reverses the problem.

The solution is to use references to parameter entities in place of the INCLUDE and IGNORE keywords:

someExternal.dtd: 
  <![ %XML; [
    ... XML-only definitions
  ]]>
  <![ %SGML; [
    ... SGML-only definitions
  ]]>
  ... common definitions  

Then each document that uses the DTD can set up the appropriate entity definitions:

<!DOCTYPE foo SYSTEM "someExternal.dtd" [
  <!ENTITY % XML  "INCLUDE" >
  <!ENTITY % SGML "IGNORE" >
]>
<foo>
  ...
</foo>  

This procedure puts each document in control of the DTD. It also replaces the INCLUDE and IGNORE keywords with variable names that more accurately reflect the purpose of the conditional section, producing a more readable, self-documenting version of the DTD.

Resolving A Naming Conflict

The XML structures you have created thus far have actually encountered a small naming conflict. It seems that xhtml.dtd defines a title element which is entirely different from the title element defined in the slideshow DTD. Because there is no hierarchy in the DTD, these two definitions conflict.


Note: The Modularized XHTML DTD also defines a title element that is intended to be the document title, so we can't avoid the conflict by changing xhtml.dtd--the problem would only come back to haunt us later.


You could use XML namespaces to resolve the conflict. You'll take a look at that approach in the next section. Alternatively, you could use one of the more hierarchical schema proposals described in Schema Standards. The simplest way to solve the problem for now, though, is simply to rename the title element in slideshow.dtd.


Note: The XML shown here is contained in slideshow3.dtd and slideSample09.xml, which references copyright.xml and xhtml.dtd. (The browsable versions are slideshow3-dtd.html, slideSample09-xml.html, copyright-xml.html, and xhtml-dtd.html.)


To keep the two title elements separate, you'll create a "hyphenation hierarchy". Make the changes highlighted below to change the name of the title element in slideshow.dtd to slide-title:

<!ELEMENT slide (image?, slide-title?, item*)>
<!ATTLIST slide 
      type   (tech | exec | all) #IMPLIED
> 
<!-- Defines the %inline; declaration -->
<!ENTITY % xhtml SYSTEM "xhtml.dtd">
%xhtml; 
<!ELEMENT slide-title (%inline;)*> 

Save this DTD as slideshow3.dtd.

The next step is to modify the XML file to use the new element name. To do that, make the changes highlighted below:

...
<slide type="all">
<slide-title>Wake up to ... </slide-title>
</slide> 
... 
<!-- OVERVIEW -->
<slide type="all">
<slide-title>Overview</slide-title>
<item>... 

Save a copy of this file as slideSample09.xml.

Using Namespaces

As you saw earlier, one way or another it is necessary to resolve the conflict between the title element defined in slideshow.dtd and the one defined in xhtml.dtd when the same name is used for different purposes. In the previous exercise, you hyphenated the name in order to put it into a different "namespace". In this section, you'll see how to use the XML namespace standard to do the same thing without renaming the element.

The primary goal of the namespace specification is to let the document author tell the parser which DTD or schema to use when parsing a given element. The parser can then consult the appropriate DTD or schema for an element definition. Of course, it is also important to keep the parser from aborting when a "duplicate" definition is found, and yet still generate an error if the document references an element like title without qualifying it (identifying the DTD or schema to use for the definition).


Note: Namespaces apply to attributes as well as to elements. In this section, we consider only elements. For more information on attributes, consult the namespace specification at http://www.w3.org/TR/REC-xml-names/.


Defining a Namespace in a DTD

In a DTD, you define a namespace that an element belongs to by adding an attribute to the element's definition, where the attribute name is xmlns ("xml namespace"). For example, you could do that in slideshow.dtd by adding an entry like the following in the title element's attribute-list definition:

<!ELEMENT title (%inline;)*>
<!ATTLIST title 
  xmlns CDATA #FIXED "http://www.example.com/slideshow"
> 

Declaring the attribute as FIXED has several important features:

To be thorough, every element name in your DTD would get the exact same attribute, with the same value. (Here, though, we're only concerned about the title element.) Note, too, that you are using a CDATA string to supply the URI. In this case, we've specified an URL. But you could also specify a URN, possibly by specifying a prefix like urn: instead of http:. (URNs are currently being researched. They're not seeing a lot of action at the moment, but that could change in the future.)

Referencing a Namespace

When a document uses an element name that exists in only one of the.DTDs or schemas it references, the name does not need to be qualified. But when an element name that has multiple definitions is used, some sort of qualification is a necessity.


Note: In point of fact, an element name is always qualified by it's default namespace, as defined by name of the DTD file it resides in. As long as there as is only one definition for the name, the qualification is implicit.


You qualify a reference to an element name by specifying the xmlns attribute, as shown here:

<title xmlns="http://www.example.com/slideshow">
  Overview
</title> 

The specified namespace applies to that element, and to any elements contained within it.

Defining a Namespace Prefix

When you only need one namespace reference, it's not such a big deal. But when you need to make the same reference several times, adding xmlns attributes becomes unwieldy. It also makes it harder to change the name of the namespace at a later date.

The alternative is to define a namespace prefix, which as simple as specifying xmlns, a colon (:) and the prefix name before the attribute value, as shown here:

<SL:slideshow xmlns:SL='http:/www.example.com/slideshow'
    ...>
  ...
</SL:slideshow> 

This definition sets up SL as a prefix that can be used to qualify the current element name and any element within it. Since the prefix can be used on any of the contained elements, it makes the most sense to define it on the XML document's root element, as shown here.


Note: The namespace URI can contain characters which are not valid in an XML name, so it cannot be used as a prefix directly. The prefix definition associates an XML name with the URI, which allows the prefix name to be used instead. It also makes it easier to change references to the URI in the future.


When the prefix is used to qualify an element name, the end-tag also includes the prefix, as highlighted here:

<SL:slideshow xmlns:SL='http:/www.example.com/slideshow'
      ...>
  ...
  <slide>
    <SL:title>Overview</SL:title>
  </slide>
  ...
</SL:slideshow> 

Finally, note that multiple prefixes can be defined in the same element, as shown here:

<SL:slideshow xmlns:SL='http:/www.example.com/slideshow'
      xmlns:xhtml='urn:...'>
  ... 
</SL:slideshow> 

With this kind of arrangement, all of the prefix definitions are together in one place, and you can use them anywhere they are needed in the document. This example also suggests the use of URN to define the xhtml prefix, instead of an URL. That definition would conceivably allow the application to reference a local copy of the XHTML DTD or some mirrored version, with a potentially beneficial impact on performance.

Divider
FAQ
History
PreviousHomeNext Search
Feedback
Divider

All of the material in The J2EE Tutorial for the Sun ONE Platform is copyright-protected and may not be published in other works without express written permission from Sun Microsystems.