软件中的文本本地化
一、gnu的解决方案
从网上的资料可以看到,gnu对于本地化的支持是基于gettext套件完成。通过xgettext工具扫描代码中出现的字符串,生成po(Portable Object)文件。在代码中再通过特定的函数来读取并对字符串进行本地化。
二、xgettext的实现
1、keyword
从代码上看,xgettext的实现比较简单,但是它的特点是支持很多不同的文件类型,例如rust、xml、php等。看起来,一个字符串扫描的工具实现并不复杂,但xgettext的实现其实包含了更加精细的语法控制。它不是简单的字符串扫描,而是精确到了指定函数,这些函数在xgetext的代码中称为keyword。
在init_keywords函数中,xgetext初始化了C语言使用时需要关注的特殊函数
//gettext-tools\src\x-c.c
/* Finish initializing the keywords hash tables.
Called after argument processing, before each file is processed. */
static void
init_keywords ()
{
if (default_keywords)
{
/* When adding new keywords here, also update the documentation in
xgettext.texi! */
x_c_keyword ("gettext");
x_c_keyword ("dgettext:2");
x_c_keyword ("dcgettext:2");
x_c_keyword ("ngettext:1,2");
x_c_keyword ("dngettext:2,3");
//……
}
//……
}
通过man手册,可以知道这些函数的原型是
char * gettext (const char * msgid);
char * dgettext (const char * domainname, const char * msgid);
char * dcgettext (const char * domainname, const char * msgid,
int category);
char * ngettext (const char * msgid, const char * msgid_plural,
unsigned long int n);
char * dngettext (const char * domainname,
const char * msgid, const char * msgid_plural,
unsigned long int n);
char * dcngettext (const char * domainname,
const char * msgid, const char * msgid_plural,
unsigned long int n, int category);
结合xgetext工具库对于关键字解析结果的定义
/* Calling convention for a given keyword. */
struct callshape
{
int argnum1; /* argument number to use for msgid */
int argnum2; /* argument number to use for msgid_plural */
int argnumc; /* argument number to use for msgctxt */
bool argnum1_glib_context; /* argument argnum1 has the syntax "ctxt|msgid" */
bool argnum2_glib_context; /* argument argnum2 has the syntax "ctxt|msgid" */
int argtotal; /* total number of arguments */
string_list_ty xcomments; /* auto-extracted comments */
};
这个结构的注释已经很清楚的说明了一个关键字描述的意义:
第一个参数表示msgid、第二个表示msgid_plural,这个是不同语言单复数的形式可能不同。
其它可选的参数还包括
- msgctxt使用的参数数量,在keyword描述中通过"c"(context)表示;
- argnum1_glib_context第一个参数是否是"ctxt|msgid"语法结构,在keyword中通过"g"(glib)表示
- argnum2_glib_context第一个参数是否是"ctxt|msgid"语法结构,在keyword中通过"g"(glib)表示
- argtotal表示参数总个数,在keyword中通过"t"(total)表示
- xcomments表示注释,在keyword中通过双引号表示。
如果keyword描述 - 没有数值参数,则第一个参数为msgid
- 只有一个数值参数,该数值对应msgid的位置
- 如果有使用冒号(:)分割的两个数值,则分别表示msgid和plural。
2、扫描
有了这个keyword定义之后,源代码的扫描就有章可循,过程也比较直观了。通过搜索特性形式的函数(函数名+做括弧),然后通过逗号(,)记录参数msgid信息。
这个过程隐含了一个基本结论:
并不是代码中的所有字符串都会被扫描到,只有特定keyword中指定列表的调用才会被扫描到(可以通过-a选项扫描所有字面字符串)
3、举个栗子
可以看到,如果执行时不加-a选项,输出结果并不会包含字面字符串。常用的_并不是内置函数,需要通过-k选项指定是一个关键字供多语言版本识别。
tsecer@harry: cat -n input.c
1 int main(int argc, const char *argv[])
2 {
3 const char *pstr = "in literal string";
4 gettext("in gettext" __FILE__);
5 _("in underscore");
6 return 0;
7 }
tsecer@harry: xgettext input.c
tsecer@harry: cat messages.po
# SOME DESCRIPTIVE TITLE.
# Copyright (C) YEAR THE PACKAGE'S COPYRIGHT HOLDER
# This file is distributed under the same license as the PACKAGE package.
# FIRST AUTHOR <EMAIL@ADDRESS>, YEAR.
#
#, fuzzy
msgid ""
msgstr ""
"Project-Id-Version: PACKAGE VERSION\n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2022-06-22 17:14+0800\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"Language-Team: LANGUAGE <LL@li.org>\n"
"Language: \n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=CHARSET\n"
"Content-Transfer-Encoding: 8bit\n"
#: input.c:4
msgid "in gettext"
msgstr ""
tsecer@harry: xgettext input.c -k_
tsecer@harry: cat messages.po
# SOME DESCRIPTIVE TITLE.
# Copyright (C) YEAR THE PACKAGE'S COPYRIGHT HOLDER
# This file is distributed under the same license as the PACKAGE package.
# FIRST AUTHOR <EMAIL@ADDRESS>, YEAR.
#
#, fuzzy
msgid ""
msgstr ""
"Project-Id-Version: PACKAGE VERSION\n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2022-06-22 17:15+0800\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"Language-Team: LANGUAGE <LL@li.org>\n"
"Language: \n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=CHARSET\n"
"Content-Transfer-Encoding: 8bit\n"
#: input.c:4
msgid "in gettext"
msgstr ""
#: input.c:5
msgid "in underscore"
msgstr ""
tsecer@harry: xgettext input.c -a
tsecer@harry: cat messages.po
# SOME DESCRIPTIVE TITLE.
# Copyright (C) YEAR THE PACKAGE'S COPYRIGHT HOLDER
# This file is distributed under the same license as the PACKAGE package.
# FIRST AUTHOR <EMAIL@ADDRESS>, YEAR.
#
#, fuzzy
msgid ""
msgstr ""
"Project-Id-Version: PACKAGE VERSION\n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2022-06-22 17:15+0800\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"Language-Team: LANGUAGE <LL@li.org>\n"
"Language: \n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=CHARSET\n"
"Content-Transfer-Encoding: 8bit\n"
#: input.c:3
msgid "in literal string"
msgstr ""
#: input.c:4
msgid "in gettext"
msgstr ""
#: input.c:5
msgid "in underscore"
msgstr ""
tsecer@harry:
三、gcc的使用
1、gcc的预处理
查看gcc的多语言支持,可以发现其中包含很多不是gettext引用的类型。例如下面的字符串,它的出现位置是在error函数中的。
//gcc\po\gcc.pot
#: cfgrtl.c:2661
msgid "flow control insn inside a basic block"
msgstr ""
查看gcc\po\exgettext脚本,可以发现其中做了额外处理。主要是搜索了
define
宏定义,并且定义参数中包含了“msgid”关键字,搜索出该关键字的位置,然后通过xgettext的
-kWORD, --keyword=WORD look for WORD as an additional keyword --flag=WORD:ARG:FLAG additional flag for strings inside the argument number ARG of keyword WORD
keyword、flag选项添加关键字的位置以及标志位。其中-k选项和代码中的描述格式一致,也可以指定参数的位置。
例如前面提到的msgid,就是在下面这个宏中定义的。可以看到参数中的确有msgid这个关键字
rtl.h
#define fatal_insn(msgid, insn) \
_fatal_insn (msgid, insn, __FILE__, __LINE__, __FUNCTION__)
2、exgettext中相关代码
可以看到其中包含了# define类型宏定义和msgid两个关键字的匹配,也有--keyword=选项的生成。
# ....
if (n == 1) { keyword = "--keyword=" name }
else {
keyword = "--keyword=" name ":" n
if (name ~ /_n$/)
keyword = keyword "," (n + 1)
}
if (format) {
keyword=keyword "\n--flag=" name ":" n ":" format
if (name ~ /_n$/)
keyword = keyword "\n--flag=" name ":" (n + 1) ":" format
}
# ....
END {
for (f in files) {
file = files[f]
lineno = 1
while (getline < file) {
if (/^(#[ ]*define[ ]*)?[A-Za-z_].*\(.*msgid[,\)]/) {
keyword_option($0)
} else if (/^(#[ ]*define[ ]*)?[A-Za-z_].*(\(|\(.*,)$/) {
name_line = $0
while (getline < file) {
lineno++
if (/msgid[,\)]/){
keyword_option(name_line $0)
break
} else if (/,$/) {
name_line = name_line $0
continue
} else break
}
} else if (/%e/ || /%n/) {
spec_error_string($0)
}
lineno++
}
}
print emsg > posr
}'
四、gettext的实现
1、获取转换后内容
可以看到,它通过特定文件夹下domain加上mo后缀作为本地化(前面还加上了所说的category,是message、datetime之类的)。
//glibc-2.17\intl\dcigettext.c
char *
DCIGETTEXT (domainname, msgid1, msgid2, plural, n, category)
const char *domainname;
const char *msgid1;
const char *msgid2;
int plural;
unsigned long int n;
int category;
{
//……
stpcpy (mempcpy (stpcpy (stpcpy (xdomainname, categoryname), "/"),
domainname, domainname_len),
".mo");
//……
}
2、domain从哪里来
c库提供了textdomain和bindtextdomain用来指定本地化的搜索路径和mo文件名。例如gcc在初始化时设置的domain就是gcc,所以多语言使用的也是gcc.mo,其中的LOCALEDIR是在gcc构建时确定的安装位置,也预订了本地gcc.mo文件的安装路径。
void
gcc_init_libintl (void)
{
#ifdef HAVE_LC_MESSAGES
setlocale (LC_CTYPE, "");
setlocale (LC_MESSAGES, "");
#else
setlocale (LC_ALL, "");
#endif
(void) bindtextdomain ("gcc", LOCALEDIR);
(void) textdomain ("gcc");
//……
}
3、locale从哪里来
在前面的代码中也可以看到,在执行textdomain之前,还执行了setlocale函数,该函数的manual说明
If locale is "", each part of the locale that should be modified is set according to the environment variables. The details are implementation-dependent. For glibc, first
(regardless of category), the environment variable LC_ALL is inspected, next the environment variable with the same name as the category (LC_COLLATE, LC_CTYPE, LC_MESSAGES,
LC_MONETARY, LC_NUMERIC, LC_TIME) and finally the environment variable LANG. The first existing environment variable is used. If its value is not a valid locale specifica‐
tion, the locale is unchanged, and setlocale() returns NULL.
设置字符串为空,则主要通过LANG环境变量自动获得。
glibc的代码可以看到,LC_ALL的优先级最高,然后是对应的类型名,例如"LC_MESSAGES",最后是“LANG”环境变量,如果都没有,使用c。
//findlocale.c
struct __locale_data *
internal_function
_nl_find_locale (const char *locale_path, size_t locale_path_len,
int category, const char **name)
{
//……
if ((*name)[0] == '\0')
{
/* The user decides which locale to use by setting environment
variables. */
*name = getenv ("LC_ALL");
if (*name == NULL || (*name)[0] == '\0')
*name = getenv (_nl_category_names.str
+ _nl_category_name_idxs[category]);
if (*name == NULL || (*name)[0] == '\0')
*name = getenv ("LANG");
}
if (*name == NULL || (*name)[0] == '\0'
|| (__builtin_expect (__libc_enable_secure, 0)
&& strchr (*name, '/') != NULL))
*name = (char *) _nl_C_name;
//……
}
但是,dcgettext是通过guess_category_value获得locale的,所以"LANGUAGE"的优先级更高,也就对应了这里描述的优先级
The locale environment variables tell the OS how to display or output certain kinds of text. They’re prioritized, allowing us to influence which one(s) will come into play in various scenarios:
LANGUAGE
LC_ALL
LC_xxx, while taking into account the locale category
LANG
/* Guess value of current locale from value of the environment variables. */
static const char *
internal_function
guess_category_value (category, categoryname)
int category;
const char *categoryname;
{
const char *language;
const char *retval;
/* The highest priority value is the `LANGUAGE' environment
variable. But we don't use the value if the currently selected
locale is the C locale. This is a GNU extension. */
language = getenv ("LANGUAGE");
if (language != NULL && language[0] == '\0')
language = NULL;
/* We have to proceed with the POSIX methods of looking to `LC_ALL',
`LC_xxx', and `LANG'. On some systems this can be done by the
`setlocale' function itself. */
#ifdef _LIBC
retval = __current_locale_name (category);
#else
retval = _nl_locale_name (category, categoryname);
#endif
return language != NULL && strcmp (retval, "C") != 0 ? language : retval;
}
4、gcc文本本地化文件路径
tsecer@harry: file /usr/share/locale/zh_CN/LC_MESSAGES/gcc.mo
/usr/share/locale/zh_CN/LC_MESSAGES/gcc.mo: GNU message catalog (little endian), revision 0.0, 5001 messages
tsecer@harry:
工作流
这个文档简单的描述了使用msg的整个工作流。