JFP 開発ガイド

第 3 章日本語ロケールと文字分類

この章では、XPG で仕様が決定された国際化文字対応・分類プログラミングインタフェースと、Solaris で仕様が追加された日本語ロケール用文字対応・分類インタフェースを紹介し、日本語文字の分類について説明します。

概説

XPG は、各ロケールの LC_CTYPE カテゴリに対して一定の文字クラスを定義し、アプリケーションプログラマが各文字を処理する際の分類 (大文字アルファベットかどうかなど) を支援します。XPG では、大文字・小文字の対応を定義する仕組みも LC_CTYPE の中で定義しているため、toupper() や tolower() などの API を介して簡単に大文字・小文字の変換ができます。

このような XPG が規定する文字クラス分類・文字対応 API には、それぞれバイト単位処理・ワイド文字表現単位処理の両方が用意されています。

Solaris では、日本語テキストで使用する文字 (漢字など) を同様の API で処理できるように、日本語ロケール専用の文字分類クラスと文字対応を追加しています。これらの文字対応と文字分類の処理は、各文字をワイド文字表現に変換して API を呼び出します。Solaris の JFP で提供された日本語ロケールの定義については『JFP ユーザーズガイド』の第 2 章を参照してください。

日本語文字対応

文字対応の操作に使用する API のうち、日本語ロケールの文字集合の処理に有効な API を表 3-1、表 3-2 に示します。API は、これ以外にも用意されています。詳しくはマニュアルページ (towctrans(3C)、wctrans(3C)、wctrans_ja(3C) など) を参照してください。

表 3-1 文字対応 API その 1


`XPG で規定され``るインタフェース名`	`作用`
`towupper (wc)`	ワイド文字 wc に対応する大文字ワイド文字表現を返す。なければ wc をそのまま返す
`towlower (wc)`	ワイド文字 wc に対応する小文字ワイド文字表現を返す。なければ wc をそのまま返す
`towctrans (wc, wctrans("タグ名"))`	タグ名に基づいてワイド文字 wc に対応したワイド文字表現を返す。なければ wc をそのまま返す
`wctrans ("タグ名")`	`towctrans ()` で使うタグ名に対応する値を返す。以下のタグ名を使用できる "`tolower`" "`toupper`"

表 3-2 文字分類 API その 2


`日本語 Solaris で拡張された``インタフェース名`	`作用`
`wctrans("タグ名")`	`towctrans()` で使うタグ名に対応する値を返す。Solaris の日本語ロケールでは、以下のタグ名を使用できる `"tojhira"` `"tojkata"` `"tojisx0208"` `"tojisx0201"`

プログラム例

ここでは、日本語文字対応の API を使用したプログラム例を 2 つ紹介します。日本語文字対応 API を使用する場合は、wchar.h、wctype.h ヘッダーファイルを取り込み、処理の最初の段階で setlocale() を呼び出して、動作ロケールを適切に設定する必要があります。

例 3-1 では、入力ファイル中に存在する JIS X 0201 かな文字 (半角カナ) を JIS X 0208 かな文字に変換するフィルタを紹介します。既存のアプリケーション・ネットワークの中には、規約上、または実装上の制限により JIS X 0201 かな文字を使用したデータ通信ができないものがあります。そのような環境に対しては、通信に先立って、JIS X 0201 かな文字を JIS X 0208 かな文字に変換するなど、入力ファイルの加工が必要です。この例では、入力ファイルを 1 行ずつワイド文字列として読み込み、各ワイド文字を towctrans() で変換しています。

例 3-1 日本語文字対応 API

sun% cat my_kanato208.c

/*
 * Read lines from a file and convert JIS X 0201 kana
 * characters to the correspondent ones in JIS X 0208
 * set. This will stop processing if the input file
 * reaches EOF. It is assumed that each line has
 * at most BUFSIZ -1 wide char length.
 *
 * Actual processing is done by my_kanato208(), which
 * does the followings.
 *      1.      Get the length of wide string.
 *      2.      Convert each wide char from the top
 *              of the string by applying towctrans().
 *              (The return value of towctrans() will be
 *              the same if there's no correspondent char.)
 *      3.      Write the correspondent wide char to
 *              original string.
 */
#include <stdio.h> 
#include <locale.h>
#include <wchar.h>
#include <wctype.h>

static void my_kanato208(wchar_t *);

int
main(int argc, char *argv[])
{       
        wchar_t buf[BUFSIZ];
        
        setlocale(LC_ALL, "");
        
        while (fgetws(buf, BUFSIZ, stdin) != (wchar_t *)NULL) {
                my_kanato208(buf);
                fprintf(stdout, "%S",buf);
        }
        return (0);
}

static void
my_kanato208(wchar_t *wcp)
{
        size_t wstr_len;
        wint_t retval;
        int index;
        
        wstr_len = wcslen(wcp);
        for (index = 0; index < wstr_len; index++) {
                retval = towctrans((wint_t)wcp[index], 
wctrans("tojisx0208"));
                wcp[index] = retval;
        }
}
sun% cat file3
新しいシステム*は現在のネットワーク環境を変えることなく
インターネット*とのシームレス*な接続を可能にします。また
セキュリティ*の問題も新しい認証テクノロジー*を用いることで
アドミニストレータ*の負担を減らしています。
sun% cc -o my_kanato208 my_kanato208.c
sun% cat file3 | ./my_kanato208
新しいシステムは現在のネットワーク環境を変えることなく
インターネットとのシームレスな接続を可能にします。また
セキュリティの問題も新しい認証テクノロジーを用いることで
アドミニストレータの負担を減らしています。

注意 -

* の部分のカタカナは、半角カタカナになります。

例 3-2 は wcstol() の拡張例です。現在の Solaris が提供する日本語ロケールでは、JIS X 0208 文字集合で表された数値文字列に対して、直接 wcstol() を呼び出すことができません。そこで、数値文字列をワイド文字列データとして読み込み、towctrans() で対応する JIS X 0201 文字に変換し、wcstol() を呼び出しています。

例 3-2 `wcstol()`の拡張

sun% cat my_wcstol.c
/*
 * Read lines from a file and convert tokenized
 * wide char string to long integer.
 * Conversion will stop if the input file reaches
 * EOF, and output the sum of input integers.
 * It is assumed that each line has at most
 * BUFSIZ - 1 wide char length.
 *
 * Actual conversion is done by my_wcstol(), which
 * does the followings.
 *      1.      Get the length of wide char string.
 *      2.      Convert each wide char from the top
 *              of the string by applying towctrans().
 *              The correspondent JIS X 0201 wide char value
 *              will be gotten for each JIS X 0208 digit chars
 *              in the string.
 *              (The return value of towctrans() will be
 *              the same if there's no correspondent char.)
 *      3.      Write the correspondent wide char to
 *              original string.
 *      4.      Call wcstol() with the converted wide string.
 */
 #include <stdio.h>
 #include <locale.h>
 #include <wchar.h>
 #include <wtype.h>
 #include <errno.h>

 #define         WRET            L'¥n'

 static long my_wcstol(wchar_t *, wchar_t **, int);

 int
 main(int argc, char *argv[])
 {
        wchar_t buf[BUFSIZ];
        wchar_t *headp, *nextp;
        long retval, total;
        setlocale(LC_ALL, "");
        total = retval = 0;
        while (fgetws(buf, BUFSIZ, stdin) != (wchar_t *)NULL) {
                headp = buf;
                while (headp != (wchar_t *)NULL) {
                        errno = 0;
                        retval = my_wcstol(headp, 0);
                        if (errno != 0) {
                                if (nextp[0] == WRET) {
                                        break;
                                } else {                            
                                        perror("my_wcstol()");
                                        exit (-1);
                                }
                        }
                        fprintf(stdout, "retval = [%ld]¥n", retval);
                        total += retval;
                        headp = nextp;
                }
        }
        fprintf(stdout, "Total = %ld.¥n", total);
        return (0);
 }
 static long
 my_wcstol(wchar_t *wcp, wchar_t **endp, int base)
 {
        size_t wstr_len;
        wint_t retval;
        int index;
        long ret_val;
        wstr_len = wcslen(wcp);
        for (index = 0; index < wstr_len; index++) {
                retval = towctrans((wint_t)wcp[index], wctrans("tojisx0201"));
                wcp[index] = (wchar_t)retval;
        }        ret_val = wcstol((const wchar_t *)wcp, endp, base);
        return (ret_val);
 }
 sun% cat file4
 343 34534                                       12
 ３４５３４５                            ３４５３４５
 ３９８５７      ３９８                          ５８３４５８９
 sun% cc -o my_wcstol my_wcstol.c
 sun% ./my_wcstol < file4
 retval = [343]
 retval = [34534]
 retval = [12]
 retval = [345345]
 retval = [345345]
 retval = [39857]
 retval = [398]
 retval = [5834589]
 Total = 6600423.

日本語文字分類

文字分類の操作のための API のうち、日本語ロケールの文字集合の処理に有効な API を表 3-3、表 3-4 に示します。API は、これ以外にも用意されています。詳しくはマニュアルページ (iswalpha(3C)、wctype(3C)、wctype_ja(3C)) などを参照してください。

表 3-3 文字分類 API その 1


`XPG で規定される``インタフェース名`	作用
`iswctype(wc, type)`	`wc` が type クラスに属するかどうか調べる
`wctype("タイプ名")`	`iswctype()` の第 2 引数を、タイプ名から作成する。 XPG で標準文字クラスとして規定されているものは以下のとおり
	`"alnum"` `"alpha"` `"cntrl"` `"digit"` `"graph"` `"lower"` `"print"` `"punct"` `"space"` `"upper"` `"xdigit"` `"blank"`

表 3-4 文字分類 API その 2


`日本語 Solaris で拡張された``インタフェース名`	`作用`
`wctype`("タイプ")	`iswctype()` の第 2 引数を、タイプ名から作成する。Solaris で日本語ロケール向けに拡張された文字クラスは以下のとおり
	`"jkanji"` `"jkata"` `"jhira"` `"jdigit"` `"jparen"` `"jline"` `"jisx0201r"` `"jisx0208"` `"jisx0212"` `"udc"` `"vdc"`
	`"jalpha"` `"jspecial"` `"jgreek"` `"jrussian"` `"junit"` `"jsci""jgen"` `"jpunct"`

プログラム例

ここでは、日本語文字分類 API を使用したプログラム例を紹介します。この場合も前述の文字対応 API の場合と同様に、wchar.h ヘッダーファイルを取り込み、処理の最初の段階で setlocale() を呼び出して、動作ロケールを適切に設定する必要があります。

例 3-3 では、入力ファイルをワイド文字列として読み込んで入力中の JIS X 0208 ひらがなとカタカナを交換し、JIS X 0208 数字文字を ASCII に変換して出力します。

例 3-3 文字分類 API

sun% cat my_charconv.c
/*
 * Read lines from a file and convert JIS X 0208 hiragana
 * characters to JIS X 0208 katakana characters, and
 * vice versa. In addition, JIS X 0208 digit characters
 * are converted to the correspondent ones in JIS X 0201
 * characters.
 * Conversion will stop if the input file reaches EOF.
 * It is assumed that each line has at most BUFSIZ - 1
 * wide char length.
 *
 * Actual conversion is done by my_charconv(), which does
 * the followings.
 *	1.	Get the length of the wide string.
 *	2.	Convert each wide char from the top
 *		of the string by applying towctrans().
 *		(The return value of towctrans() will be
 *		the same if there's no correspondent char.)
 *	3.	Write the correspondent wide char to
 *		original string and output it.
 */

#include <stdio.h>
#include <locale.h>
#include <wchar.h>
#include <wctype.h>
#include <jctype.h>
#include <errno.h>

#define			WRET		L'¥n'

static int my_charconv(wchar_t *);

int
main(int argc, char *argv[])
{
	wchar_t buf[BUFSIZ];
	wchar_t *headp, *nextp;
	long retval, total;

	setlocale(LC_ALL, "");
	total = retval = 0;

while (fgetws(buf, BUFSIZ, stdin) != (wchar_t *)NULL) {
		retval = my_charconv(buf);
		if (retval == -1) {
			perror("my_charconv()");
			exit(-1);
		}
		fprintf(stdout, "%S", buf);
	}

	return (0);
}

static int
my_charconv(wchar_t *wcp)
{
	size_t wstr_len;
	wint_t retval;
	int index;
	long ret_val;

	wstr_len = wcslen(wcp);
	for (index = 0; index < wstr_len; index++) {
		errno = 0;
		if (iswctype((wint_t)wcp[index], wctype("jhira")))
			retval = towctrans((wint_t)wcp[index], wctrans("tojkata"));
		else if (iswctype((wint_t)wcp[index], wctype("jkata")))
			retval = towctrans((wint_t)wcp[index], wctrans("tojhira"));
		else if (iswctype((wint_t)wcp[index], wctype("jdigit")))
			retval = towctrans((wint_t)wcp[index], wctrans("tojisx0201"));
		else
			retval = wcp[index];

		if (errno != 0)
			return (-1);
		wcp[index] = (wchar_t)retval;
	}

	return (0);
}
sun% cat file5
ひらがなはかたかなに置換されます。
カタカナハヒラガナニ置換サレマス。
漢字、記号、全角ａｌｐｈａｂｅｔや
JIS X 0201 カナナドハ* 置換 サレマセン*。
sun% cc -o my_charconv my_charconv.c
sun% ./my_charconv < file5
ヒラガナハカタカナニ置換サレマス。
かたかなはひらがなに置換されます。
漢字、記号、全角ａｌｐｈａｂｅｔヤ
JIS X 0201 カナナドハ* 置換 サレマセン*。

注意 -

* の部分のカタカナは、半角カタカナになります。

第 3 章 日本語ロケールと文字分類

概説

日本語文字対応

プログラム例

例 3-1 日本語文字対応 API

例 3-2 wcstol()の拡張

日本語文字分類

プログラム例

例 3-3 文字分類 API

第 3 章日本語ロケールと文字分類

例 3-2 `wcstol()`の拡張