Code Set Conversion

Language:

Support for code set conversion, or character set (charset) conversion, is an essential part of the operating system, as most of the applications rely on this capability to function properly.

Oracle Solaris also includes the International Components for Unicode (ICU), a widely used library and tools for Unicode support, software internationalization and software globalization.

Oracle Solaris 11 includes various tools and libraries for code set conversion. The core code set conversion utility, iconv, is built around the iconv library in Oracle Solaris libc.

`iconv` Utility

The iconv(1) command-line utility converts characters or sequences of characters from one code set to another. It supports a wide range of code sets. Because code set names often differ among platforms, many of the code sets are supported under multiple names thanks to an aliasing mechanism in iconv. Run the following command to obtain the list of code sets currently available in a system:

$ /usr/bin/iconv -l

Because multiple packages have iconv modules, you can extend the default list by installing additional packages. The default installation includes the system/library/iconv/utf-8 package, which covers the basic set of iconv modules for conversions among UTF-8 and other Unicode code sets and selected other code sets. Other packages are available in the System/Internationalization category in the Package Manager, or by using the system/library/iconv/* name pattern for installation with the pkg(1) command.

The iconv -f command defines the source code set and the –t option defines the target code set. You can use iconv to convert a file, or standard input, to standard output as follows:

$ /usr/bin/iconv -f eucJP -t UTF-8 file.txt

This example would convert file.txt filename from the eucJP code set (Extended UNIX Code Packed Format for Japanese) and write the result in UTF-8 to standard output.

In Oracle Solaris 11, iconv is extended to include flags that modify the behavior of the conversion in these special situations:

Illegal character – The input character is not valid in the declared source code set
Non-identical character – There is no matching character in the target code set

Flags like //ILLEGAL_DISCARD, //NON_IDENTICAL_DISCARD, //IGNORE and //TRANSLIT can also be used at the command line. For more information, see the iconv_open(3C) man page.

Note - Some of the iconv modules in Oracle Solaris might implement only a subset of the flags described in the iconv_open(3C) man page.

For more information about iconv, see the iconv(1), iconv(3C), iconv_open(3C), and related man pages.

International Components for Unicode

Oracle Solaris 11 adds the International Components for Unicode (ICU) C/C++ libraries to the available interfaces.

About ICU

ICU is a mature, widely used set of libraries providing Unicode and globalization support for software applications. ICU is portable and gives applications the same results on all platforms and between C/C++ and Java software.

Some of the services provided by ICU include:

Code Page Conversion – Convert text data to or from Unicode and nearly any other character set or encoding.
Collation – Compare strings according to the conventions and standards of a particular language, region, or country.
Formatting – Format numbers, dates, times, and currency amounts according to chosen locale.
Time Calculations – Multiple types of calendars and a thorough set of timezone calculation APIs are provided.
Unicode Support – ICU closely tracks the Unicode standard, providing easy access to all of the many Unicode character properties, Unicode normalization, case folding, and other fundamental operations as specified by the Unicode Standard.
Regular Expression – ICU regular expressions fully support Unicode while providing very competitive performance.
Bidirectional text (Bidi) – Support for handling text containing a mixture of left-to-right and right-to-left data.
Text Boundaries – Locate the positions of words, sentences, and paragraphs within a range of text, or identify locations that would be suitable for line wrapping when displaying the text.

ICU on Oracle Solaris 11 is split into two packages: library/icu contains just the libraries, while developer/icu delivers header files and several utilities like uconv(1).

For more information, see the project's web site at http://site.icu-project.org. The libicui18n(3LIB), libicuio(3LIB), libicudata(3LIB), libicule(3LIB), libiculx(3LIB), libicutu(3LIB), and libicuuc(3LIB) man pages document how to use the libraries in Oracle Solaris.

`uconv` Utility

In addition to iconv(1), the uconv(1) command that is a part of the International Components for Unicode (ICU) toolset can also be used to convert text from one encoding to another. uconv supports 229 encodings and more than 1000 aliases.

The tool is a part of the developer/icu package that is not installed by default. To install it, issue the following command:

$ pkg install developer/icu

To convert a text in the cp-1252 encoding to UTF-8, you would type the following command:

$ uconv -f cp1252 -t UTF-8 -o file_in_utf8.txt file_in_cp1252_encoding.txt

Another feature of uconv is transliteration – the conversion of letters from one script to another without translating the underlying words. The following example transliterates "Solaris" in Greek characters to "Solaris" in Latin characters:

$ echo "Σολαρις"| uconv -x Greek-Latin -f utf-8 -t utf-8
Solaris

For more information about this tool's features, see the uconv (1) man page.

File Examiner (`fsexam`)

The File Encoding Examiner fsexam utility enables you to convert the name of a file, or the contents of a plain text file, from a legacy character encoding to UTF-8 encoding. The fsexam utility includes the following new features:

Encoding list customization
Encoding auto-detection
Support for dry runs, log files, batch conversion, file filtering, symbolic files, command line, and special file types like compressed files

To add fsexam to your system install the storage/fsexam package. For more information, see the fsexam(1) and fsexam(4) man pages.

Auto Encoding Finder (`auto_ef`)

Oracle Solaris includes auto_ef(1), a command-line utility to identify the encoding of a file. auto_ef judges the encoding by using the iconv code conversion, determining whether a certain code conversion was successful with the file. It also performs frequency analysis on the character sequences that appear in the file. For example,

$ auto_ef test_file
eucJP

With the –a option, it displays all possible encodings for the given file:

$ auto_ef -a test_file
eucJP           0.89
zh_CN.euc       0.40
ko_KR.euc       0.01

To add auto_ef to your system install the text/auto_ef package. For more information, see the auto_ef(1) man page.

International Language Environments Guide for Oracle® Solaris 11.3