libc 之 locales

 

1 Locales

软件的国际化,意味着使软件符合用户的习惯。 ISO C 中,通过 locale 来实现这一目的。

每一台机器可以支持多个 locales , 用户可以通过环境变量来设置程序将要使用的 locale.

1.1 Locale 的作用

每个 locale 均由若干为不同目的而定义的规范构成。 这些规范包括:

  • 什么样的宽字符序列是合法的,以及如何来解释他们。
  • 如何对字符进行分类。
  • 本地语言和字符的对照表。
  • 如何格式化数字的显示。
  • 输出以及错误提示使用何种语言。
  • 使用何种语言来回答 yes-or-no questions。
  • 使用何种语言来应对复杂的用户输入。

1.2 Locale 的选择

选择 (设置) Locale 的最简方法是设置环境变量: LANG , 该方法将会选择这个 locale 的所有规范。例如:

[yyc@localhost ~]$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

同时,我们也可以单独设置一个 locale 中的某个单独的规范, 例如早期的 fcitx (Linux 下的中文输入法), 要求 LC_CTYPE 必须为 GB2312 , 则可以进行如下设置:

[yyc@localhost ~]$ export LC_CTYPE="zh_CN.GB2312"
[yyc@localhost ~]$ locale
LANG=en_US.UTF-8
LC_CTYPE=zh_CN.GB2312
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

一个系统不一定支持所有的 locales , 但所有的系统都需要支持一个标准的 Locale —— "C" 或者 "POSIX" 。

1.3 Locales 影响到的 Activities 的类别

locale 定义的规范可以分为若干类别,这些类别如下, 其中,每个类别的名字既可以作为环境变量名而在环境变量中找到, 也可以作为宏名在函数 setlocale 中作为参数。

  • LC_COLLATE

    影响字符串的校对。

  • LC_TYPE

    影响字符的分类,以及将字符转换成多字节和宽字符。

  • LC_MONETARY

    影响估计货币的格式化输出。

  • LC_NUMERIC

    影响数字的格式化输出。

  • LC_TIME

    影响日期和时间的格式化输出。

  • LC_MESSAGES

    影响用户接口中消息中使用的语言及用于匹配 yes-or-no questions 答案的正则表达式。

  • LC_ALL

    该符号并非环境变量,用在 setlocale() 中,用于设置上述所有的类别。

  • LANG

    如果设置了该环境变量,则该环境变量的值会影响上述所有的类别, 除非用户又显示地、重新设置了上述类别中的某一个。

1.4 Locale 的设置

由 C Family 编写的应用程序启动时可以自动继承通过环境变量设置的 locale , 但这种继承仅限于应用程序本身,对应用程序所使用的库不起作用 —— 这些库提供的函数将默认使用标准库中的 C Locale 。

我们可以通过 setlocale() 来通知库函数使用由环境变量指定的 locale:

setlocale(LC_ALL, "");

setlocale() 还可以用来指定 locale 中的某个单独的规范:

char * setlocale (int CATEGORY, const char *LOCALE);

该函数用于将当前 Locale 中的 CATEGORY 设置为 LOCALE 。

  • 如果 *LOCALE 为 NULL, 则返回当前使用的 LOCALE;
  • 如果 *LOCALE 不为 NULL且合法, 则返回当设置成功后使用的 LOCALE;
  • 如果 *LOCALE 不为 NULL且不合法, 则当前 locale 不变,函数返回 NULL。

1.5 标准 Locales

前面提到,并非所有的系统都支持所有的 locales , 但是所有的系统都必须支持若干标准的 locales, 这些标准 Locales 包括:

  • C:
    由标准 C 指定的 locale , 其属性和行为均符合 ISO C 标准。
  • POSIX:
    POSIX locale,Linux 下的 POSIX locale 当前与 C 完全一样。
  • ""
    空 locale ,使用该 locale 的程序会自动使用环境变量中规定的 locale 。

    locales 的定义和安装通常是由系统管理员完成的。

1.6 Locale 信息的获取

有多种方式可以用于获取 locale 信息, 其中最简单的方法是让 C library 自己去获取, 很多 Library 都可以这样去做。 以 strftime() 为例,同样的代码,在不同的 locale 下,输出会随 locale 而变。

但 有时程序无法自动完成 locale 信息的获取, 此时我们足要自己去做。 用来完成这个目的的函数有两个 localeconv() 和 nl_langinfo() 。 其中,前者是 标准C 提供的,可移植性好,但借口超烂。后者是 Unix 接口, 只要系统遵循 Unix 标准,就可以使用。

1.6.1 蹩脚的 localeconv

localeconv() 同 setlocale() 一样,是由标准 C 提供的,可移植, 但使用代价昂贵,可拓展性差。并且,它接提供了访问 locale 中的 LC_MONETARY 和 LC_NUMERIC , 通用性差。

localeconv() 原型为:

struct lconv * localeconv (void);

该函数返回一个 lconv 结构的指针, lconv 结构中的元素包含了如何在当前 locale 中格式化输出数字和货币的一些信息。 Glibc 中,其定义如下:

/* Structure giving information about numeric and monetary notation.  */
struct lconv
{
  /* Numeric (non-monetary) information.  */

  char *decimal_point;      /* Decimal point character.  */
  char *thousands_sep;      /* Thousands separator.  */
  /* Each element is the number of digits in each group;
     elements with higher indices are farther left.
     An element with value CHAR_MAX means that no further grouping is done.
     An element with value 0 means that the previous element is used
     for all groups farther left.  */
  char *grouping;

  /* Monetary information.  */

  /* First three chars are a currency symbol from ISO 4217.
     Fourth char is the separator.  Fifth char is '\0'.  */
  char *int_curr_symbol;
  char *currency_symbol;    /* Local currency symbol.  */
  char *mon_decimal_point;  /* Decimal point character.  */
  char *mon_thousands_sep;  /* Thousands separator.  */
  char *mon_grouping;       /* Like `grouping' element (above).  */
  char *positive_sign;      /* Sign for positive values.  */
  char *negative_sign;      /* Sign for negative values.  */
  char int_frac_digits;     /* Int'l fractional digits.  */
  char frac_digits;     /* Local fractional digits.  */
  /* 1 if currency_symbol precedes a positive value, 0 if succeeds.  */
  char p_cs_precedes;
  /* 1 iff a space separates currency_symbol from a positive value.  */
  char p_sep_by_space;
  /* 1 if currency_symbol precedes a negative value, 0 if succeeds.  */
  char n_cs_precedes;
  /* 1 iff a space separates currency_symbol from a negative value.  */
  char n_sep_by_space;
  /* Positive and negative sign positions:
     0 Parentheses surround the quantity and currency_symbol.
     1 The sign string precedes the quantity and currency_symbol.
     2 The sign string follows the quantity and currency_symbol.
     3 The sign string immediately precedes the currency_symbol.
     4 The sign string immediately follows the currency_symbol.  */
  char p_sign_posn;
  char n_sign_posn;
#ifdef __USE_ISOC99
  /* 1 if int_curr_symbol precedes a positive value, 0 if succeeds.  */
  char int_p_cs_precedes;
  /* 1 iff a space separates int_curr_symbol from a positive value.  */
  char int_p_sep_by_space;
  /* 1 if int_curr_symbol precedes a negative value, 0 if succeeds.  */
  char int_n_cs_precedes;
  /* 1 iff a space separates int_curr_symbol from a negative value.  */
  char int_n_sep_by_space;
  /* Positive and negative sign positions:
     0 Parentheses surround the quantity and int_curr_symbol.
     1 The sign string precedes the quantity and int_curr_symbol.
     2 The sign string follows the quantity and int_curr_symbol.
     3 The sign string immediately precedes the int_curr_symbol.
     4 The sign string immediately follows the int_curr_symbol.  */
  char int_p_sign_posn;
  char int_n_sign_posn;
#else
  char __int_p_cs_precedes;
  char __int_p_sep_by_space;
  char __int_n_cs_precedes;
  char __int_n_sep_by_space;
  char __int_p_sign_posn;
  char __int_n_sign_posn;
#endif
};

具体含义,参考其中注释。

1.6.2 优雅、迅捷的 nl_langinfo

char *nl_langinfo(ln_item ITEM);

nl_langinfo() 用于访问 locale 中的细节,粒度细,速度快。 其中, ITEM 定义在头文件 langinfo.h 中,解释如下:

`CODESET'
      `nl_langinfo' returns a string with the name of the coded
      character set used in the selected locale.

`ABDAY_1'
`ABDAY_2'
`ABDAY_3'
`ABDAY_4'
`ABDAY_5'
`ABDAY_6'
`ABDAY_7'
      `nl_langinfo' returns the abbreviated weekday name.  `ABDAY_1'
      corresponds to Sunday.

`DAY_1'
`DAY_2'
`DAY_3'
`DAY_4'
`DAY_5'
`DAY_6'
`DAY_7'
      Similar to `ABDAY_1' etc., but here the return value is the
      unabbreviated weekday name.

`ABMON_1'
`ABMON_2'
`ABMON_3'
`ABMON_4'
`ABMON_5'
`ABMON_6'
`ABMON_7'
`ABMON_8'
`ABMON_9'
`ABMON_10'
`ABMON_11'
`ABMON_12'
      The return value is abbreviated name of the month.  `ABMON_1'
      corresponds to January.

`MON_1'
`MON_2'
`MON_3'
`MON_4'
`MON_5'
`MON_6'
`MON_7'
`MON_8'
`MON_9'
`MON_10'
`MON_11'
`MON_12'
      Similar to `ABMON_1' etc., but here the month names are not
      abbreviated.  Here the first value `MON_1' also corresponds
      to January.

`AM_STR'
`PM_STR'
      The return values are strings which can be used in the
      representation of time as an hour from 1 to 12 plus an am/pm
      specifier.

      Note that in locales which do not use this time representation
      these strings might be empty, in which case the am/pm format
      cannot be used at all.

`D_T_FMT'
      The return value can be used as a format string for
      `strftime' to represent time and date in a locale-specific
      way.

`D_FMT'
      The return value can be used as a format string for
      `strftime' to represent a date in a locale-specific way.

`T_FMT'
      The return value can be used as a format string for
      `strftime' to represent time in a locale-specific way.

`T_FMT_AMPM'
      The return value can be used as a format string for
      `strftime' to represent time in the am/pm format.

      Note that if the am/pm format does not make any sense for the
      selected locale, the return value might be the same as the
      one for `T_FMT'.

`ERA'
      The return value represents the era used in the current
      locale.

      Most locales do not define this value.  An example of a
      locale which does define this value is the Japanese one.  In
      Japan, the traditional representation of dates includes the
      name of the era corresponding to the then-emperor's reign.

      Normally it should not be necessary to use this value
      directly.  Specifying the `E' modifier in their format
      strings causes the `strftime' functions to use this
      information.  The format of the returned string is not
      specified, and therefore you should not assume knowledge of
      it on different systems.

`ERA_YEAR'
      The return value gives the year in the relevant era of the
      locale.  As for `ERA' it should not be necessary to use this
      value directly.

`ERA_D_T_FMT'
      This return value can be used as a format string for
      `strftime' to represent dates and times in a locale-specific
      era-based way.

`ERA_D_FMT'
      This return value can be used as a format string for
      `strftime' to represent a date in a locale-specific era-based
      way.

`ERA_T_FMT'
      This return value can be used as a format string for
      `strftime' to represent time in a locale-specific era-based
      way.

`ALT_DIGITS'
      The return value is a representation of up to 100 values used
      to represent the values 0 to 99.  As for `ERA' this value is
      not intended to be used directly, but instead indirectly
      through the `strftime' function.  When the modifier `O' is
      used in a format which would otherwise use numerals to
      represent hours, minutes, seconds, weekdays, months, or
      weeks, the appropriate value for the locale is used instead.

`INT_CURR_SYMBOL'
      The same as the value returned by `localeconv' in the
      `int_curr_symbol' element of the `struct lconv'.

`CURRENCY_SYMBOL'
`CRNCYSTR'
      The same as the value returned by `localeconv' in the
      `currency_symbol' element of the `struct lconv'.

      `CRNCYSTR' is a deprecated alias still required by Unix98.

`MON_DECIMAL_POINT'
      The same as the value returned by `localeconv' in the
      `mon_decimal_point' element of the `struct lconv'.

`MON_THOUSANDS_SEP'
      The same as the value returned by `localeconv' in the
      `mon_thousands_sep' element of the `struct lconv'.

`MON_GROUPING'
      The same as the value returned by `localeconv' in the
      `mon_grouping' element of the `struct lconv'.

`POSITIVE_SIGN'
      The same as the value returned by `localeconv' in the
      `positive_sign' element of the `struct lconv'.

`NEGATIVE_SIGN'
      The same as the value returned by `localeconv' in the
      `negative_sign' element of the `struct lconv'.

`INT_FRAC_DIGITS'
      The same as the value returned by `localeconv' in the
      `int_frac_digits' element of the `struct lconv'.

`FRAC_DIGITS'
      The same as the value returned by `localeconv' in the
      `frac_digits' element of the `struct lconv'.

`P_CS_PRECEDES'
      The same as the value returned by `localeconv' in the
      `p_cs_precedes' element of the `struct lconv'.

`P_SEP_BY_SPACE'
      The same as the value returned by `localeconv' in the
      `p_sep_by_space' element of the `struct lconv'.

`N_CS_PRECEDES'
      The same as the value returned by `localeconv' in the
      `n_cs_precedes' element of the `struct lconv'.

`N_SEP_BY_SPACE'
      The same as the value returned by `localeconv' in the
      `n_sep_by_space' element of the `struct lconv'.

`P_SIGN_POSN'
      The same as the value returned by `localeconv' in the
      `p_sign_posn' element of the `struct lconv'.

`N_SIGN_POSN'
      The same as the value returned by `localeconv' in the
      `n_sign_posn' element of the `struct lconv'.

`INT_P_CS_PRECEDES'
      The same as the value returned by `localeconv' in the
      `int_p_cs_precedes' element of the `struct lconv'.

`INT_P_SEP_BY_SPACE'
      The same as the value returned by `localeconv' in the
      `int_p_sep_by_space' element of the `struct lconv'.

`INT_N_CS_PRECEDES'
      The same as the value returned by `localeconv' in the
      `int_n_cs_precedes' element of the `struct lconv'.

`INT_N_SEP_BY_SPACE'
      The same as the value returned by `localeconv' in the
      `int_n_sep_by_space' element of the `struct lconv'.

`INT_P_SIGN_POSN'
      The same as the value returned by `localeconv' in the
      `int_p_sign_posn' element of the `struct lconv'.

`INT_N_SIGN_POSN'
      The same as the value returned by `localeconv' in the
      `int_n_sign_posn' element of the `struct lconv'.

`DECIMAL_POINT'
`RADIXCHAR'
      The same as the value returned by `localeconv' in the
      `decimal_point' element of the `struct lconv'.

      The name `RADIXCHAR' is a deprecated alias still used in
      Unix98.

`THOUSANDS_SEP'
`THOUSEP'
      The same as the value returned by `localeconv' in the
      `thousands_sep' element of the `struct lconv'.

      The name `THOUSEP' is a deprecated alias still used in Unix98.

`GROUPING'
      The same as the value returned by `localeconv' in the
      `grouping' element of the `struct lconv'.

`YESEXPR'
      The return value is a regular expression which can be used
      with the `regex' function to recognize a positive response to
      a yes/no question.  The GNU C library provides the `rpmatch'
      function for easier handling in applications.

`NOEXPR'
      The return value is a regular expression which can be used
      with the `regex' function to recognize a negative response to
      a yes/no question.

`YESSTR'
      The return value is a locale-specific translation of the
      positive response to a yes/no question.

      Using this value is deprecated since it is a very special
      case of message translation, and is better handled by the
      message translation functions (*note Message Translation::).

      The use of this symbol is deprecated.  Instead message
      translation should be used.

`NOSTR'
      The return value is a locale-specific translation of the
      negative response to a yes/no question.  What is said for
      `YESSTR' is also true here.

      The use of this symbol is deprecated.  Instead message
      translation should be used.

posted @ 2011-11-03 11:15  英超  Views(642)  Comments(0Edit  收藏  举报