F Working with XPath Queries

This appendix provides detailed information about the support available within RUEI for the use of XPath queries.

XPath (XML Path Language) is a query language that can be used to query data from XML documents. In RUEI, XPath queries can be used for content scanning of XML documents. A complete specification of XPath is available at http://www.w3.org/TR/xpath. It is based on a tree representation of the XML document, and selects nodes by a variety of criteria. In popular use, an XPath expression is often referred to simply as an XPath.

RUEI supports the use of a limited set of XPath expressions to identify page names and Web services, and in performing page content and functional error checks. Optionally, you can extend the search to include the search for a literal string within the found element(s).

Note that XPath expressions are case sensitive.

Basic XPath Queries

Consider the following simple XML document that has a root element <a>, which has one child element <b>, which in turn has two child elements, <c> and <d>.

<?xml version="1.0" encoding="UTF-8"?><a>  <b>    <c>Hello world!</c>    <d price="$56" />  </b></a>

In XPath queries, the child-of relation is indicated with a / (slash) and element names are written without angle brackets (< and >). Hence, a/b means select <b> elements that are children of <a> elements. A / at the start of a query indicates that the first node in the path is the root element of the document. For example, the following query selects <c> elements that are children of a <b> element that is a child of the root element <a>:

/a/b/c

When used for content scanning, this would extract the text "Hello world!" from the above example document. As another example, the query /html/body/div/p would extract the contents of all paragraphs inside a <div> in the body of an XHTML document.

Besides extracting the contents of elements, there is one other type of data that can be extracted; XML attribute values. To query attributes, you can refer to them as a "child" of the element of which they are an attribute. To distinguish attribute names from element names, they must be prefixed with a @ character. An @attribute node may only appear as the very last node in an XPath. For example, the following query extracts the text "$56" from the above example document:

/a/b/d/@price

Restrictions

The XPath syntax supported by RUEI is a subset of the abbreviated XPath syntax. As a result, you may find that some syntax elements that work correctly in other XPath applications do not work in RUEI. For example, the following queries are not accepted:

//c          # error, // not supported/a/*/b       # error, * not supported/a/b/c/../b  # error, . and .. not supported

In addition, the following queries, although perfectly fine, will not extract anything from the above example document:

/a/c         # no <c> elements are children of the <a> element/b/c         # <b> is not the root element/a/b/e       # the document does not have <e> elements

Element and attribute names are case-sensitive. Hence, /a/b/c is not the same as /A/B/C.

In RUEI, all XPath queries must be absolute paths. That is, they must start at the root node, and each child element along the path must be named explicitly.

Indices and Attribute Predicates

Consider the slightly more complex XML document:

<?xml version="1.0" encoding="UTF-8"?><inventory>  <item class="food">    <name>Bread</name>    <amount>12</amount>  </item>  <other>    <msg>not available</msg>  </other>  <item class="cleaning">    <name>Soap</name>    <amount>33</amount>  </item>  <item class="food" type="perishable">    <name>Milk</name>    <amount>56</amount>  </item></inventory>

The root element <inventory> has three <item> children, and an <other> child. By using an index [N] on a node in an XPath query, we can explicitly select the N-th <item> child element (counting starts at 1, not 0):

/inventory/item[2]/name  # extracts "Soap"

Note that when working the above example document, there is no point in specifying an index on the <name> node. There are three <name> elements in the document, but they are all children of a different <item> element. Hence, they each are the first child.

/inventory/item/name[2]  # extracts nothing

Attribute predicates are another way to specify more precisely which elements you want to select. They come in two forms: [@attr="value"] selects only elements that have the attr attribute set to value, and [@attr] selects only elements that have an attr attribute (set to any value).

/inventory/item[@class="cleaning"]/name  # extracts "Soap"/inventory/item[@type]/name              # extracts "Milk"

The and keyword can be used to combine multiple attribute predicates within a single node. However, the XPath keyword or is not supported. In addition, instead of double quotes (") you can use single quotes (') to enclose the attribute value.

/inventory/item[@class='food' and @type]/name # extracts "Milk"

Indices and attribute predicates can be combined. The difference between the following two queries is that query A first selects all <item> elements with class="food", and then takes the second one, while query B selects the second <item> element under the condition that it has class="food" (but in the example it has class="cleaning").

A: /inventory/item[@class="food"][2]/name  # extracts "Milk"B: /inventory/item[2][@class="food"]/name  # extracts nothing

Example

Consider the following XML-SOAP messages:

<?xml version="1.0" ?><env:Envelope xmlns:env="http://www.w3.org/2003/05/soap-envelope"        xmlns:xml="http://www.w3.org/XML/1998/namespace">  <env:Header>    <env:Upgrade>      <env:SupportedEnvelope qname="ns1:Envelope"          xmlns:ns1="http://www.w3.org/2003/05/soap-envelope"/>      <env:SupportedEnvelope qname="ns2:Envelope"          xmlns:ns2="http://schemas.xmlsoap.org/soap/envelope/"/>    </env:Upgrade>  </env:Header>  <env:Body>    <env:Fault>      <env:Code>        <env:Value>env:VersionMismatch</env:Value>      </env:Code>      <env:Reason>        <env:Text xml:lang="en">Version Mismatch</env:Text>      </env:Reason>    </env:Fault>  </env:Body></env:Envelope>

The error value env:VersionMismatch can be extracted with the following XPath query:

/env:Envelope/env:Body/env:Fault/env:Code/env:Value

Important

In order to apply XPath queries to a real-time HTTP data stream, RUEI only supports a limited set of XPath 1.0 functionality. In particular:

All input is regarded as ASCII. Hence, the use of character set encoding (such as UTF-8 and UTF-16) will lead to unreliable results.
References to internal and external files (such as DTDs) within input traffic are ignored.
The self-or-descendant (//) operator is not supported.
The maximum number of depths supported in XPath expressions is 8 levels.
No string within an expression should be a complete substring of any other specified string. Strings have a maximum length of 256 bytes.

In addition, you should be aware of the following:

RUEI assumes that all input traffic is XML. While XHTML is supported, it is interpreted as well-formed XML. Hence, using XPath queries on non-well-formed XML or non-XML traffic can lead to unreliable results.
The use of namespaces and CDATA is not supported. If they appear in the input stream, they are treated literally. This can lead to false matches.
All expressions are resolved as "AND". The use of the "OR" and relational expressions (such as <=, >=, <, and >) is not supported.

Using Third-Party XPath Tools

For convenience, you can use third-party XPath tools, such as the XPather extension for Mozilla Firefox, to create XPath expressions for use within RUEI. The XPather extension is available at http://xpath.alephzarro.com/index.

When installed, you can right-click within a page, and select the Show in XPather option. An example is shown in Appendix F.

Figure F-1 XPather Tool

Description of "Figure F-1 XPather Tool"

You can then copy the XPath expression within the XPather browser (shown in Figure F-2) and use it the basis for your XPath query with RUEI. Be aware that you should review the generated XPath expression to ensure that it confirms to the restrictions described above.

Figure F-2 XPather Browser

Description of "Figure F-2 XPather Browser"