G Working with National Language Support

This appendix provides a detailed discussion of the character encoding standards supported by RUEI when monitoring network traffic. Restrictions to the identification of such things as domain names, custom headers, and functional errors are highlighted. The operation of data masking and user ID matching when working with international character sets is also discussed.

G.1 Introduction

Collectors can monitor network traffic containing data in a wide variety of encoding standards. A complete list of the encoding standards currently supported by RUEI is shown in Table G-1.

Table G-1 Supported Encodings

Canonical Name MIME NameFoot 1  Description

Big5

Big5

Traditional Chinese.

EUC-JP

EUC-JP

EUC-encoding Japanese.

GB_2312-80

GB_2312-80, gb2312, chinese

Chinese.

GBK

GBK, CP936, MS936, windows-936

Simplified Chinese.

ISO-8859-1

ISO-8859-1, ISO_8859-1, latin1

Latin alphabet no. 1.

ISO-8859-10

ISO-8859-10, latin6

Latin alphabet no. 6 (Nordic).

ISO-8859-13

ISO-8859-13

Latin alphabet no. 7 (Baltic Rim).

ISO-8859-14

ISO-8859-14, latin8

Latin alphabet no. 8 (Celtic).

ISO-8859-15

ISO-8859-15, latin9

Latin alphabet no. 9.

ISO-8859-16

ISO-8859-16, latin10

Latin alphabet no. 10 (south-eastern Europe).

ISO-8859-2

ISO-8859-2, ISO_8859-2, latin2

Latin alphabet no. 2 (central and eastern Europe).

ISO-8859-3

ISO-8859-3, latin3

Latin alphabet no. 3 (southern Europe).

ISO-8859-4

ISO-8859-4, latin4

Latin alphabet no. 4 (northern Europe).

ISO-8859-5

ISO-8859-5, cyrillic

Cyrillic.

ISO-8859-6

ISO-8859-6, arabic

Arabic.

ISO-8859-7

ISO-8859-7, greek

Greek.

ISO-8859-8

ISO-8859-8, hebrew

Hebrew.

ISO-8859-9

ISO-8859-9, latin5

Latin alphabet no. 5 (Turkish).

KOI8-R

KOI8-R

Russian.

Shift_JIS

Shift_JIS, shift-JIS

Japanese.

US-ASCII

US-ASCII, ascii

American Standard Code for Information Interchange (ASCII).

UTF- 32

UTF-32

32-bit UCS transformation format. Also known as UCS-4.

UTF-16

UTF-16

16-bit UCS transformation format, byte order identified by an optional byte-order mark.

UTF-16BE

UTF16BE

16-bit unicode transformation format, big-endian byte order.

UTF-16LE

UTF16LE

16-bit unicode transformation format, little-endian byte order.

UTF-32BE

UTF32BE

32-bit unicode transformation format, big-endian byte order.

UTF-32LE

UTF32LE

32-bit unicode transformation format, little-endian byte order.

UTF-8

UTF-8

8-bit UCS transformation format.

windows-1250

windows-1250

Microsoft Windows Eastern European.

windows-1251

windows-1251

Microsoft Windows Cyrillic (Russian)

windows-1252

windows-1252

Microsoft Windows Latin.

windows-1253

windows-1253

Microsoft Windows Greek.

windows-1254

windows-1254

Microsoft Windows Turkish.

windows-1255

windows-1255

Microsoft Windows Hebrew.

windows-1256

windows-1256

Microsoft Windows Arabic.

windows-1257

windows-1257

Microsoft Windows Baltic.

windows-1258

windows-1258

Microsoft Windows Vietnamese.


Footnote 1 The name (and supported aliases) as recognized in the HTTP encoding declarations.

Note that vendor-specific web site encoding may not be supported. Network traffic containing non-supported encoding is still recorded, but matching may not be possible. For example, the content of a page can still be viewed in the Replay Viewer, but the page's defined name may not be correctly associated with it.

Web Site Configuration

To correctly monitor a web site that uses international text, it is essential that the web site is properly configured. For example, if its web server advertises UTF-8, but the actual pages are not UTF-8 encoded, RUEI cannot correctly monitor them, even when some web browsers can autodetect and correct the unsupported contents. Therefore, such things as functional error and content checks will not operate correctly for these pages.

G.2 Implementation Considerations

Data Masking

Collectors can be configured to omit the logging of sensitive information. This is described in Section 13.6, "Masking User Information". Only ASCII argument names are supported. The encoding used in the argument's content does not matter because it is replaced anyway.

Particular attention should be paid to variable names that contain a dollar ($) character. For example, foo$bar can be transmitted in monitored traffic as foo%24bar (this is browser dependent). In this case, to mask this variable correctly, the percent-encoded variable name should be specified.

Be aware that the variables to be masked must be specified in ASCII format, and be specified exactly as they are reported within session diagnostics. For example, the variable name user name would be reported with session diagnostics as user%20name, but can also appear as user+name. Hence, both variable names should be specified for masking.

If the argument name contains non-ASCII characters, you should use session diagnostics (described in Chapter 4, "Working With the Session Diagnostics Feature") to see how it is reported, and specify this reported name as the variable to be masked. In addition, you should regularly check the log files to ensure the data is being correctly masked.

Note the restrictions and requirements described above for masking URL arguments also apply to any situation in which you want direct access to a URL argument. For example, custom dimensions or application definitions.

Note:

HTML form field names (not values) should be in ASCII format to ensure that they are correctly masked.

Custom Headers and Cookies

All header names must be encoded in ASCII because this is required by the HTTP protocol. Within header contents, all non-ASCII characters are replaced by a placeholder.

User ID Matching

Within RUEI, user identification is first based on the HTTP Authorization field. If this is not found, the application's user identification scheme is used. This can be specified in terms of URLs, cookies, request or response headers, or XPath expressions. This is explained in Section 7.3.14, "Defining User Identification".

Because a URL argument is a name=value combination, the name part is specified as the source argument from which the user ID will be read. The value part is extracted and reported as the user ID. The specified source argument is subject to the same requirements as explained earlier for data masking. However, the value part of the combination can be specified in any supported encoding. RUEI attempts to translate the value from its native encoding (for example, Shift-JIS) to UTF-8 so that it can be rendered within the user interface in the native language (for example, Japanese).

However, when the native encoding of the value is not known, the user ID cannot be properly rendered within the user interface, and the reported value is garbled. Due to the limitations of the HTTP protocol, user IDs on some web sites may not be rendered as expected. In that case, it is recommended you specify the Collector encoding that should be used. This is explained in Section G.4, "Specifying the URL Argument/Collector Encoding". Note the encoding specified for this setting is only applicable to URL and POST arguments. Content-based reporting (for example, functional errors) is not affected by this setting. Because this does not guarantee the correct rendering of all values, you should also review the web site definitions, and verify all user IDs are ASCII only.

G.3 Specifying Content Checks

Be aware that, when specifying page content checks, the content rendered within the client browser (and seen by the end user) may differ from the underlying HTML page source. This is because of underlying font, format, and link tags, as well entity definitions, and so on. Hence, simply copying and pasting a portion of text from the rendered page within a client browser may not always work as expected.

Normally, this problem can be overcome by copying and pasting from the View source facility within the client browser. However, for pages that use an encoding other than UTF-8, this approach does not work if you are using Internet Explorer 6 or 7. The reason for this is that IE uses Notepad as its source viewer, and this only supports UTF-8. As a result, the source may appear garbled, and cannot meaningfully be copied and pasted into RUEI.

Because Mozilla Firefox employs an internal HTML source rendering tool, it is always able to render the HTML source accurately, even for non-UTF-8 encodings. Therefore, it is recommended you use this browser as the basis for content-based checks, and whenever an accurate rendition of the HTML source is required.

G.4 Specifying the URL Argument/Collector Encoding

In order for RUEI to correctly report on monitored network traffic, it must understand the encoding used within that traffic. RUEI can monitor network traffic containing data in a wide variety of character encoding standards. Table G-1 provides a complete list of the encoding standards supported by RUEI.

Generally speaking, RUEI first attempts to use the document encoding specified for the corresponding HTML document. That is, so-called auto-detection. If this fails to produce a satisfactory result, the Collector encoding (if specified) is used to decode URL and POSTed form arguments.

Be aware that the Collector encoding is not a manual override to the document encoding. Rather, it specifies the encoding that RUEI should attempt to use once the document encoding has failed to satisfactorily decode the URL arguments. If the Collector encoding also fails to produce a satisfactory result, the arguments are reported in their original (non-decoded) format.

URL Argument and Collector Encoding

To specify the URL argument and Collector encoding, do the following:

  1. Select Configuration, then Security, and then Collector encoding. The panel shown in Figure G-1 appears.

    Figure G-1 Collector Encoding

    Description of Figure G-1 follows
    Description of ''Figure G-1 Collector Encoding''

  2. Click the currently defined Collector encoding for the required Collector profile. By default, no Collector encoding is defined. The dialog shown in Figure G-2 appears.

    Figure G-2 Edit Collector Encoding Dialog

    Description of Figure G-2 follows
    Description of ''Figure G-2 Edit Collector Encoding Dialog''

  3. Use the Collector encoding menu to specify the encoding to be used by Collectors within the selected Collector profile for URL arguments within application filters, and when auto-detection fails. The list of available encodings is equivalent to that shown in Table G-1.

    When ready, click Save. Any change you make to this setting takes effect almost immediately.

Important

When using this facility, you should pay particular attention to the following points:

  • This setting is only applicable to the decoding of URL arguments within application definitions (see Section 7.3, "Defining Applications"). Content-based reporting (for example, functional errors) is not affected by this setting. In addition, the selected Collector encoding applies across all applications, pages, and domains monitored by the selected profile's Collectors.

  • If you are using international characters sets within your web sites, it is strongly recommended you carefully review your web site content, and the encodings used for it. In addition, you should regularly review the reporting of full URL arguments to ensure that they are correct.