setlocale与编码转换那些事

最近项目上调查printf语句不能正常格式化字符串的问题，做下总结。

以sprintf_s函数来说明问题的现象。

int sprintf_s(
   char *buffer,
   size_t sizeOfBuffer,
   const char *format [,
      argument] ... 
);

View Code

问题发生的条件

使用了setlocale
format参数是utf8编码
format的%是非ASC字符，已locale编码来解释的话，刚好%与前面的编码结合，被当初了一个新的文字。

例如我们出问题时，locale是".ACP",也就是ANSI（CP932），format参数是"日付%04d/%02d/%02d 時刻%02d:%02d:%02d.%03d",UTF编码如下：

00000003h: E6 97 A5 E4 BB 98 25 30 34 64 2F 25 30 32 64 2F ; 譌･莉・04d/%02d/
00000013h: 25 30 32 64 20 E6 99 82 E5 88 BB 25 30 32 64 3A ; %02d 譎ょ綾%02d:
00000023h: 25 30 32 64 3A 25 30 32 64 2E 25 30 33 64 ; %02d:%02d.%03d

其中日付对应的UTF8编码是

E6 97 A5 E4 BB 98

CP932环境下E6 97是一个独立的SJIS双字节文字【譌】。A5是一个单字节文字【･】。E4 BB是一个双字节文字【莉】，

98 与 25刚好满足SJIS双字节文字的编码要求，首字节在0x81～0x9f或者0xe0～0xef之间。所以对应print来说，第一个%s已经不存在了，自然不能被替换。

下面就来说说setlocale。

影响的函数

依据https://msdn.microsoft.com/en-US/library/x99tb11d(v=vs.80).aspx，setlocale会根据设定的category不同，产生不同的影响。

LC_COLLATE

The strcoll, _stricoll, wcscoll, _wcsicoll, strxfrm, _strncoll, _strnicoll, _wcsncoll, _wcsnicoll, and wcsxfrm functions.

LC_TIME

The strftime and wcsftime functions.

LC_MONETARY

Monetary-formatting information returned by the localeconv function.

//简单来讲，该类别就是用来输出本地化金融信息(如钱币符号)时用的，需要显示调用

LC_NUMERIC

Decimal-point character for the formatted output routines (such as printf), for the data-conversion routines, and for the non-monetary formatting information returned bylocaleconv. In addition to the decimal-point character, LC_NUMERIC also sets the thousands separator and the grouping control string returned by localeconv.

//与LC_MONETARY类似，但是需要注意，Decimal-point character （小数点符号）是不需要显示调用的，在printf等函数输出浮点数时，会隐式的使用locale相关的小数点。

比如默认的locale是C，C的小时点就是.。

LC_CTYPE

The character-handling functions (except isdigit, isxdigit, mbstowcs, and mbtowc, which are unaffected).

不得不吐槽的是msdn关于LC_CTYPE说得不全面+不正确。mbstowcs, and mbtowc,是受LC_CTYPE影响的。

http://www.cplusplus.com/reference/clocale/setlocale/?kw=setlocale的解释貌似是正确的。

Affects character handling functions (all functions of <cctype>, except isdigit and isxdigit), and the multibyte and wide character functions.

具体是哪些方法呢，参考<cctype>我们知道 character handling functions有：

isalnum

Check if character is alphanumeric (function )

isalpha: Check if character is alphabetic (function )

isblank: Check if character is blank (function )

iscntrl: Check if character is a control character (function )

isgraph: Check if character has graphical representation (function )

islower: Check if character is lowercase letter (function )

isprint: Check if character is printable (function )

ispunct: Check if character is a punctuation character (function )

isspace: Check if character is a white-space (function )

isupper: Check if character is uppercase letter (function )

tolower: Convert uppercase letter to lowercase (function )

toupper: Convert lowercase letter to uppercase (function )

另外，参考http://gsp.com/cgi-bin/man.cgi?section=3&topic=multibyte和https://msdn.microsoft.com/en-US/library/6y9se58z(v=vs.80).aspx，我们知道 the multibyte and wide character functions.有：

Function Description
mblen 3 get number of bytes in a character
mbrlen 3 get number of bytes in a character (restartable)
mbrtowc 3 convert a character to a wide-character code (restartable)
mbsrtowcs 3 convert a character string to a wide-character string (restartable)
mbstowcs 3 convert a character string to a wide-character string
mbtowc 3 convert a character to a wide-character code
wcrtomb 3 convert a wide-character code to a character (restartable)
wcstombs 3 convert a wide-character string to a character string
wcsrtombs 3 convert a wide-character string to a character string (restartable)
wctomb 3 convert a wide-character code to a character

posted @ 2015-07-17 18:26 萧何9527 阅读(477) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

萧何9527

setlocale与编码转换那些事

公告