Understanding the SAX Parser

The SAX parser is one of two main parsers used for XML data. It is an events-based parser, as opposed to the other XML parser, DOM, which is a tree-based parser. The Xerces product, from the Apache organization, provides both XML parsers. The Xerces code is written in C++. To make XML parsing available to business functions, a C-API interface, XercesWrapper, exists to provide access to both parsers. The design of the parsers is quite different, and that provides advantages for each parser, depending on the intended usage.

The DOM parser reads the XML file and builds an internal model (DOM document tree) of that file in memory. This has the advantage of enabling you to traverse the tree, retrieve parent-child relationships, and revisit the same data multiple times. The disadvantages include high memory requirements for large XML files. Also, the entire XML file must be read into memory before any of the data in the DOM document tree can begin to be processed. The DOM parser can also be used to programmatically build a DOM document tree in memory, and then write that tree to a file, in XML format.

The SAX parser reads an XML file and as each item is read, the parser passes that piece of data to callback functions. This methodology has the advantage of enabling fast processing with minimal memory usage. Also, the parsing can be stopped after a specific item has been found. The disadvantages include that the current state of parsing must be maintained by the callback functions, and previous data items can not be revisited without rereading the XML file. Finally, the SAX parser is a read-only parser.

This is a typical sequence used for parsing an XML data file using the DOM parser:

  1. Initialize the XercesWrapper, which in turn, initializes the Xerces code.

  2. Initialize the DOM parser.

  3. Parse the XML data file.

  4. Retrieve a pointer to the root element of the DOM document tree.

  5. Retrieve additional elements and data, by traversing the DOM document tree.

    The callback functions are called whenever the specified events in the XML file are parsed.

  6. Free all DOM elements that have been retrieved.

  7. Free the DOM document tree.

  8. Free the DOM parser.

  9. Terminate the XercesWrapper interface, which in turn, closes the Xerces code.

This is a typical sequence used for parsing an XML data file, using the SAX parser:

  1. Initialize the XercesWrapper, which in turn, initializes the Xerces code.

  2. Initialize the SAX parser.

  3. Set up various callback functions for specific parsing events.

  4. Parse the XML data file.

  5. Call the callback functions as each event in the XML file is parsed.

  6. Within the callback functions, process the retrieved data and maintain a context for coordination between callback functions.

  7. Free the SAX parser.

  8. Terminate the XercesWrapper interface, which in turn, closes the Xerces code.