F Working with XPath Queries

This appendix provides detailed information about the support available within RUEI for the use of XPath queries. These can be used as part of content message, client identification, custom dimension, page and service identification definitions.

F.1 Introduction

XPath (XML Path Language) is a query language that can be used to select nodes and compute values from XML documents. Within RUEI, XPath version 1.0 support is based on the libxml2 library. Further information about it is available from the following location:

http://xmlsoft.org/

A complete specification of the XPath language is available at the following location:

http://www.w3.org/TR/xpath

Important

RUEI applies XPath matching to all traffic content, regardless of whether or not it is actually in XML format. Therefore, in order to obtain accurate results, it is strongly recommended that you ensure that all XPath expressions are executed against well-formed XHTML code, and review the information in Section F.5, "Optimizing XPath Scanning". In addition, note that XPath expressions are case sensitive.

F.2 Namespace Support

XML namespaces are used for providing uniquely named elements and attributes in an XML document. An XML instance may contain element or attribute names from more than one XML vocabulary. If each vocabulary is assigned a namespace, then the ambiguity between identically named elements or attributes can be resolved.

Within RUEI, all namespaces used in your XPath queries must be explicitly defined. If a namespace is used in a query, but is not defined, it will not work. To define a namespace, do the following:

Select Configuration, then General, Advanced settings, and then XPath namespaces. The window shown in Figure F-1 appears.

Figure F-1 Example XPath Namespaces

Description of "Figure F-1 Example XPath Namespaces"

Note that this window can also be reached by clicking the Namespaces tab when specifying an XPath-based content message, custom dimension, user identification, and page and service identification definition.
Click Add new to define a new namespace, or click an existing definition to modify it. The dialog shown in Figure F-2 appears.

Figure F-2 Add XPath Namespace Dialog

Description of "Figure F-2 Add XPath Namespace Dialog"
Specify the namespace prefix used within monitored XML document, and its corresponding base URL. The name must be unique. See Section F.3, "Understanding Namespaces Prefixes and URLs" for important information on the use of namespaces and prefixes. Note that the namespace definitions are global and are applied to all monitored traffic.

When ready, click Save. Any new definition, or modification to an existing definition, takes effect within a 5-minute period.

F.3 Understanding Namespaces Prefixes and URLs

A namespace consists of a prefix (which can be considered as a form of local binding), and a URL. This specifies the location of a document that defines the actual namespace. For example, my_ns1=http://www.w3.org/1999/xhtml.

It is important to understand that a namespace is defined by its URL (or the contents of document at that location). The prefix merely represents a reference to the namespace URL within XML nodes and elements. Hence, you can bind the same namespace to multiple prefixes, and mix these prefixes in a document. The result is that all elements will be in the same namespace because all prefixes point to the same URL.

The same prefix may be bound to a different namespace in different documents. Moreover, XML allows you to bind the same prefix to different URLs in the same document.

Different Namespaces, Same Prefix

Consider the following example:

<parent xmlns:ns1="foo">
    <child xmlns:ns2="bar">
       <ns2:some_element>
    </child>
    <child xmlns:ns2="baz">
       <ns2:some_element>
    </child>
 </parent>

Here, the two children nodes use two different namespaces (bar and baz), but use the same prefix to bind it locally. Hence, the <some_element> node has a different meaning in each of the two children.

Imagine that you want to select the second <some_element> node. You might consider using the XPath expression //ns2:some_element. However, this will not work, because RUEI cannot determine if the ns2 prefix refers to the first definition (bar) or the second (baz).

Now imagine that you want to match both of these nodes. One possible solution is to ignore the namespace definition altogether, and use a wildcard expression. For example:

//*[local-name()='some_element']

However, this expression would find all <some_element> nodes, even ones bound to namespaces in which you are not interested. It is important to understand that the actual prefix name is irrelevant. Because the prefix is local to the part of the document where it is defined, it does not matter which prefix you use in your XPath expression as long as it is a prefix bound to the namespace that is valid at that location. Hence, the following XPath expression

//ns2:some_element

where ns2 specifies baz would match the second <some_element> node. However, because the actual prefix specified in the XPath expression does not have to match the prefix used in the document, you could also use the following XPath expression:

//boo:some_element

where boo specifies bar. In this case, the XPath expression is used to find a node that is locally bound to the namespace baz, and boo is used to refer to that namespace in the XPath expression. When RUEI loads this expression, it will reference the namespace that is pointed to by boo (which in this case is baz), and record that it has to find a node called <some_element> inside the namespace baz. At that point, the prefix is no longer required. When RUEI scans a document, each time it finds a <some_element> element, it retrieves the current namespace through the locally defined prefix, and compares that to the namespace defined in the XPath expression. If they are both baz, then a match is found.

In the light of the above, RUEI can be configured to extract both nodes by using the following XPath expressions:

//boo:some_element where boo=bar//hoo:some_element where hoo=baz

Additional Example

Consider the following XML document:

<parent xmlns:ns2="foo">
    <child xmlns:ns2="bar">
       <ns2:some_element>
    </child>
 </parent>

In this case, both the parent and the child node use different namespaces, but both are bound to the same prefix (ns2). This is legal because namespaces definitions are local. To find the content of the <some_element> node, you could use the following XPath expression:

/p_parent:parent/p_child:child/p_child:some_element

where p_parent is foo and p_child is bar (or whatever local prefix strings you want).

F.4 Using Third-Party XPath Tools

For convenience, you can use third-party XPath tools, such as the XPather extension for Mozilla Firefox, to create XPath expressions for use within RUEI. The XPather extension is available at the following location:

http://xpath.alephzarro.com/index

When installed, you can right-click within a page, and select the Show in XPather option. An example is shown in Figure F-3.

Figure F-3 XPather Tool

Description of "Figure F-3 XPather Tool"

You can then copy the XPath expression within the XPather browser (shown in Figure F-4) and use it the basis for your XPath query with RUEI. Be aware that you should review the generated XPath expression to ensure that it confirms to the restrictions described above.

Figure F-4 XPather Browser

Description of "Figure F-4 XPather Browser"

Note:

If the underlying XHTML code within the page is not well-formed, the XPath expression generated by XPather may not function correctly within RUEI.

F.5 Optimizing XPath Scanning

This section describes how you can optimize XML scanning, and prevent the unnecessary scanning of large numbers of documents which can lead to excessive memory and CPU usage on a Collector system.

Which documents are scanned for defined XPath definitions is determined in the following way:

If a document is a forced object, it is not scanned. See Section 12.15, "Controlling the Reporting of Objects as Pages" for information about the file extensions regarded as forced objects, and how this can be configured.
Otherwise, if the document's Content-Type header specifies HTML or XML, it is scanned as either an HTML or XML file.
If the document's Content-Type is something other than html or xml, or the document does not contain a Content-Type header, the start of its message content is compared to predefined content strings and, if a match is found, the document will be scanned for HTML or XML content.
Finally, if the predefined content strings do not generate a match, but a default content type (HTML or XML) has been configured, then the document is scanned as though it is the specified default type.

XML/HTML Content Strings

When the Content-Type header is missing from a document, or specifies something other than HTML or XML, the start of its message content is compared to predefined strings and, if a match is found, it is scanned as though it was an HTML or XML document. The default content strings are shown in Table F-1.

Table F-1 Default XML/HTML Content Strings

XML	HTML
`<?xml`	`<html`
	`<head`
	`<!DOCTYPE`

Note that XML content strings are case sensitive, while HTML strings are not.

Additional content strings can be defined by issuing the following command as the RUEI_USER user on the Collector system:

execsql config_set_prj_value wg xpath-options type add "content-string"

where:

type specifies whether the document should be scanned as HTML (html-magic) or XML (xml-magic).
content-string specifies the string against which the start of documents should be compared.

For example:

execsql config_set_prj_value wg xpath-options xml-magic add "<soap:"

Default Content Type

If the defined content strings do not generate a match, a default content type can be assigned to these documents to ensure that they are scanned. To do so, issue the following command as the RUEI_USER user on the Collector system:

execsql config_set_prj_value wg xpath-options default-content-type add "type"

where type is either xml or html.