VC++的Unicode编程(经典之作,交流传薪)
VC++的Unicode编程
作者:韩耀旭
原文链接:http://www.vckbase.com/document/viewdoc/?id=1733
一、什么是Unicode
先从ASCII说起,ASCII是用来表示英文字符的一种编码规范。每个ASCII字符占用1个字节,因此,ASCII编码可以表示的最大字符数是255(00H—FFH)。其实,英文字符并没有那么多,一般只用前128个(00H—7FH,最高位为0),其中包括了控制字符、数字、大小写字母和其它一些符号。而最高位为1的另128个字符(80H—FFH)被称为“扩展ASCII”,一般用来存放英文的制表符、部分音标字符等等的一些其它符号。
这种字符编码规则显然用来处理英文没有什么问题。但是面对中文、阿拉伯文等复杂的文字,255个字符显然不够用。
于是,各个国家纷纷制定了自己的文字编码规范,其中中文的文字编码规范叫做“GB2312—80”,它是和ASCII兼容的一种编码规范,其实就是利用扩展ASCII没有真正标准化这一点,把一个中文字符用两个扩展ASCII字符来表示,以区分ASCII码部分。
但是这个方法有问题,最大的问题就是中文的文字编码和扩展ASCII码有重叠。而很多软件利用扩展ASCII码的英文制表符来画表格,这样的软件用到中文系统中,这些表格就会被误认作中文字符,出现乱码。
另外,由于各国和各地区都有自己的文字编码规则,它们互相冲突,这给各国和各地区交换信息带来了很大的麻烦。
要真正解决这个问题,不能从扩展ASCII的角度入手,而必须有一个全新的编码系统,这个系统要可以将中文、法文、德文……等等所有的文字统一起来考虑,为每一个文字都分配一个单独的编码。
于是,Unicode诞生了。
Unicode也是一种字符编码方法,它占用两个字节(0000H—FFFFH),容纳65536个字符,这完全可以容纳全世界所有语言文字的编码。
在Unicode里,所有的字符被一视同仁,汉字不再使用“两个扩展ASCII”,而是使用“1个Unicode”,也就是说,所有的文字都按一个字符来处理,它们都有一个唯一的Unicode码。
二、使用Unicode编码的好处
使用Unicode编码可以使您的工程同时支持多种语言,使您的工程国际化。
另外,Windows NT是使用Unicode进行开发的,整个系统都是基于Unicode的。如果调用一个API函数并给它传递一个ANSI(ASCII字符集以及由此派生并兼容的字符集,如:GB2312,通常称为ANSI字符集)字符串,那么系统首先要将字符串转换成Unicode,然后将Unicode字符串传递给操作系统。如果希望函数返回ANSI字符串,系统就会首先将Unicode字符串转换成ANSI字符串,然后将结果返回给您的应用程序。进行这些字符串的转换需要占用系统的时间和内存。如果用Unicode来开发应用程序,就能够使您的应用程序更加有效地运行。
下面例举几个字符的编码以简单演示ANSI和Unicode的区别:
字符 | A | N | 和 |
ANSI码 | 41H | 4eH | cdbaH |
Unicode码 | 0041H | 004eH | 548cH |
三、使用C++进行Unicode编程
对宽字符的支持其实是ANSI C标准的一部分,用以支持多字节表示一个字符。宽字符和Unicode并不完全等同,Unicode只是宽字符的一种编码方式。
1、宽字符的定义
在ANSI中,一个字符(char)的长度为一个字节(Byte)。使用Unicode时,一个字符占据一个字,C++在wchar.h头文件中定义了最基本的宽字符类型wchar_t:
typedef unsigned short wchar_t;
从这里我们可以清楚地看到,所谓的宽字符就是无符号短整数。
2、常量宽字符串
对C++程序员而言,构造字符串常量是一项经常性的工作。那么,如何构造宽字符字符串常量呢?很简单,只要在字符串常量前加上一个大写的L就可以了,比如:
wchar_t *str1=L" Hello";
这个L非常重要,只有带上它,编译器才知道你要将字符串存成一个字符一个字。还要注意,在L和字符串之间不能有空格。
3、宽字符串库函数
为了操作宽字符串,C++专门定义了一套函数,比如求宽字符串长度的函数是
size_t __cdel wchlen(const wchar_t*);
为什么要专门定义这些函数呢?最根本的原因是,ANSI下的字符串都是以’/0’来标识字符串尾的(Unicode字符串以“/0/0”结束),许多字符串函数的正确操作均是以此为基础进行。而我们知道,在宽字符的情况下,一个字符在内存中要占据一个字的空间,这就会使操作ANSI字符的字符串函数无法正确操作。以”Hello”字符串为例,在宽字符下,它的五个字符是:
0x0048 0x0065 0x006c 0x006c 0x006f
在内存中,实际的排列是:
48 00 65 00 6c 00 6c 00 6f 00
于是,ANSI字符串函数,如strlen,在碰到第一个48后的00时,就会认为字符串到尾了,用strlen对宽字符串求长度的结果就永远会是1!
4、用宏实现对ANSI和Unicode通用的编程
可见,C++有一整套的数据类型和函数实现Unicode编程,也就是说,您完全可以使用C++实现Unicode编程。
如果我们想要我们的程序有两个版本:ANSI版本和Unicode版本。当然,编写两套代码分别实现ANSI版本和Unicode版本完全是行得通的。但是,针对ANSI字符和Unicode字符维护两套代码是非常麻烦的事情。为了减轻编程的负担,C++定义了一系列的宏,帮助您实现对ANSI和Unicode的通用编程。
C++宏实现ANSI和Unicode的通用编程的本质是根据”_UNICODE”(注意,有下划线)定义与否,这些宏展开为ANSI或Unicode字符(字符串)。
如下是tchar.h头文件中部分代码摘抄:
#ifdef _UNICODE
typedef wchar_t TCHAR;
#define __T(x) L##x
#define _T(x) __T(x)
#else
#define __T(x) x
typedef char TCHAR;
#endif
可见,这些宏根据”_UNICODE” 定义与否,分别展开为ANSI或Unicode字符。 tchar.h头文件中定义的宏可以分为两类:
A、实现字符和常量字符串定义的宏我们只列出两个最常用的宏:
宏 | 未定义_UNICODE(ANSI字符) | 定义了_UNICODE(Unicode字符) |
TCHAR | char | wchar_t |
_T(x) | x | L##x |
注意:
“##”是ANSI C标准的预处理语法,它叫做“粘贴符号”,表示将前面的L添加到宏参数上。也就是说,如果我们写_T(“Hello”),展开后即为L“Hello”
B、实现字符串函数调用的宏
C++为字符串函数也定义了一系列宏,同样,我们只例举几个常用的宏:
宏 | 未定义_UNICODE(ANSI字符) | 定义了_UNICODE(Unicode字符) |
_tcschr | strchr | wcschr |
_tcscmp | strcmp | wcscmp |
_tcslen | strlen | wcslen |
四、使用Win32 API进行Unicode编程
Win32 API中定义了一些自己的字符数据类型。这些数据类型的定义在winnt.h头文件中。例如:
typedef char CHAR;
typedef unsigned short WCHAR; // wc, 16-bit UNICODE character
typedef CONST CHAR *LPCSTR, *PCSTR;
Win32 API在winnt.h头文件中定义了一些实现字符和常量字符串的宏进行ANSI/Unicode通用编程。同样,只例举几个最常用的:
#ifdef UNICODE
typedef WCHAR TCHAR, *PTCHAR;
typedef LPWSTR LPTCH, PTCH;
typedef LPWSTR PTSTR, LPTSTR;
typedef LPCWSTR LPCTSTR;
#define __TEXT(quote) L##quote // r_winnt
#else /* UNICODE */ // r_winnt
typedef char TCHAR, *PTCHAR;
typedef LPSTR LPTCH, PTCH;
typedef LPSTR PTSTR, LPTSTR;
typedef LPCSTR LPCTSTR;
#define __TEXT(quote) quote // r_winnt
#endif /* UNICODE */ // r_winnt
从以上头文件可以看出,winnt.h根据是否定义了UNICODE(没有下划线),进行条件编译。
Win32 API也定义了一套字符串函数,它们根据是否定义了“UNICODE”分别展开为ANSI和Unicode字符串函数。如:lstrlen。API的字符串操作函数和C++的操作函数可以实现相同的功能,所以,如果需要的话,建议您尽可能使用C++的字符串函数,没必要去花太多精力再去学习API的这些东西。
也许您从来没有注意到,Win32 API实际上有两个版本。一个版本接受MBCS字符串,另一个接受Unicode字符串。例如:其实根本没有SetWindowText()这个API函数,相反,有SetWindowTextA()和SetWindowTextW()。后缀A表明这是MBCS函数,后缀W表示这是Unicode版本的函数。这些API函数的头文件在winuser.h中声明,下面例举winuser.h中的SetWindowText()函数的声明部分:
#ifdef UNICODE
#define SetWindowText SetWindowTextW
#else
#define SetWindowText SetWindowTextA
#endif // !UNICODE
可见,API函数根据定义UNICODE与否决定指向Unicode版本还是MBCS版本。
细心的读者可能已经注意到了UNICODE和_UNICODE的区别,前者没有下划线,专门用于Windows头文件;后者有一个前缀下划线,专门用于C运行时头文件。换句话说,也就是在ANSI C++语言里面根据_UNICODE(有下划线)定义与否,各宏分别展开为Unicode或ANSI字符,在Windows里面根据UNICODE(无下划线)定义与否,各宏分别展开为Unicode或ANSI字符。
在后面我们将会看到,实际使用中我们不加严格区分,同时定义_UNICODE和UNICODE,以实现UNICODE版本编程。
五、VC++6.0中编写Unicode编码的应用程序
VC++ 6.0支持Unicode编程,但默认的是ANSI,所以开发人员只需要稍微改变一下编写代码的习惯便可以轻松编写支持UNICODE的应用程序。
使用VC++ 6.0进行Unicode编程主要做以下几项工作:
1、为工程添加UNICODE和_UNICODE预处理选项。
具体步骤:打开[工程]->[设置…]对话框,如图1所示,在C/C++标签对话框的“预处理程序定义”中去除_MBCS,加上_UNICODE,UNICODE。(注意中间用逗号隔开)改动后如图2:
图一
图二
在没有定义UNICODE和_UNICODE时,所有函数和类型都默认使用ANSI的版本;在定义了UNICODE和_UNICODE之后,所有的MFC类和Windows API都变成了宽字节版本了。
2、设置程序入口点
因为MFC应用程序有针对Unicode专用的程序入口点,我们要设置entry point。否则就会出现连接错误。
设置entry point的方法是:打开[工程]->[设置…]对话框,在Link页的Output类别的Entry Point里填上wWinMainCRTStartup。
图三
3、使用ANSI/Unicode通用数据类型
微软提供了一些ANSI和Unicode兼容的通用数据类型,我们最常用的数据类型有_T ,TCHAR,LPTSTR,LPCTSTR。
顺便说一下,LPCTSTR和const TCHAR*是完全等同的。其中L表示long指针,这是为了兼容Windows 3.1等16位操作系统遗留下来的,在Win32 中以及其它的32位操作系统中,long指针和near指针及far修饰符都是为了兼容的作用,没有实际意义。P(pointer)表示这是一个指针;C(const)表示是一个常量;T(_T宏)表示兼容ANSI和Unicode,STR(string)表示这个变量是一个字符串。综上可以看出,LPCTSTR表示一个指向常固定地址的可以根据一些宏定义改变语义的字符串。比如:
TCHAR* szText=_T(“Hello!”);
TCHAR szText[]=_T(“I Love You”);
LPCTSTR lpszText=_T(“大家好!”);
使用函数中的参数最好也要有变化,比如:
MessageBox(_T(“你好”));
其实,在上面的语句中,即使您不加_T宏,MessageBox函数也会自动把“你好”字符串进行强制转换。但我还是推荐您使用_T宏,以表示您有Unicode编码意识。
4、修改字符串运算问题
一些字符串操作函数需要获取字符串的字符数(sizeof(szBuffer)/sizeof(TCHAR)),而另一些函数可能需要获取字符串的字节数sizeof(szBuffer)。您应该注意该问题并仔细分析字符串操作函数,以确定能够得到正确的结果。
ANSI操作函数以str开头,如strcpy(),strcat(),strlen();
Unicode操作函数以wcs开头,如wcscpy,wcscpy(),wcslen();
ANSI/Unicode操作函数以_tcs开头 _tcscpy(C运行期库);
ANSI/Unicode操作函数以lstr开头 lstrcpy(Windows函数);
考虑ANSI和Unicode的兼容,我们需要使用以_tcs开头或lstr开头的通用字符串操作函数。
六、举个Unicode编程的例子
第一步:
打开VC++6.0,新建基于对话框的工程Unicode,主对话框IDD_UNICODE_DIALOG中加入一个按钮控件,双击该控件并添加该控件的响应函数:
void CUnicodeDlg::OnButton1()
{
TCHAR* str1=_T("ANSI和UNICODE编码试验");
m_disp=str1;
UpdateData(FALSE);
}
添加静态文本框IDC_DISP,使用ClassWizard给该控件添加CString类型变量m_disp。使用默认ANSI编码环境编译该工程,生成Unicode.exe。
第二步:
打开“控制面板”,单击“日期、时间、语言和区域设置”选项,在“日期、时间、语言和区域设置”窗口中继续单击“区域和语言选项”选项,弹出“区域和语言选项”对话框。在该对话框中,单击“高级”标签,将“非Unicode的程序的语言”选项改为“日语”,单击“应用”按钮,如图四:
图四
弹出的对话框单击“是”,重新启动计算机使设置生效。
运行Unicode.exe程序并单击“Button1”按钮,看,静态文本框出现了乱码。
第三步:
改为Unicode编码环境编译该工程,生成Unicode.exe。再次运行Unicode.exe程序并单击“Button1”按钮。看到Unicode编码的优势了吧。
就说这些吧,祝您好运。
转载声明:本文转自http://www.vckbase.com/document/viewdoc/?id=1733
===================================================================
Unicode, MBCS and Generic text mappings
Introduction
In order to allow your programs to be used in international markets it is worth making your application Unicode or MBCS aware. The Unicode character set is a "wide character" (2 bytes per character) set that contains every character available in every language, including all technical symbols and special publishing characters. Multibyte character set (MBCS) uses either 1 or 2 bytes per character and is used for character sets that contain large numbers of different characters (eg Asian language character sets).
Which character set you use depends on the language and the operating system. Unicode requires more space than MBCS since each character is 2 bytes. It is also faster than MBCS and is used by Windows NT as standard, so non-Unicode strings passed to and from the operating system must be translated, incurring overhead. However, Unicode is not supported on Win95 and so MBCS may be a better choice in this situation. Note that if you wish to develop applications in the Windows CE environment then all applications must be compiled in Unicode.
Using MBCS or Unicode
The best way to use Unicode or MBCS - or indeed even ASCII - in your programs is to use the generic text mapping macros provided by Visual C++. That way you can simply use a single define to swap between Unicode, MBCS and ASCII without having to do any recoding.
To use MBCS or Unicode you need only define either _MBCS
or _UNICODE
in your project. For Unicode you will also need to specify the entry point symbol in your Project settings as wWinMainCRTStartup
. Please note that if both _MBCS
and _UNICODE
are defined then the result will be unpredictable.
Generic Text mappings and portable functions
The generic text mappings replace the standard char or LPSTR types with generic TCHAR or LPTSTR macros. These macros will map to different types and functions depending on whether you have compiled with Unicode or MBCS (or neither) defined. The simplest way to use the TCHAR type is to use the CString
class - it is extremely flexible and does most of the work for you.
In conjunction with the generic character type, there is a set of generic string manipulation functions prefixed by _tcs
. For instance, instead of using the strrev
function in your code, you should use the _tcsrev
function which will map to the correct function depending on which character set you have compiled for. The table below demonstrates:
#define | Compiled Version | Example |
_UNICODE | Unicode (wide-character) | _tcsrev maps to _wcsrev |
_MBCS | Multibyte-character | _tcsrev maps to _mbsrev |
None (the default: neither _UNICODE nor _MBCS defined) | SBCS (ASCII) | _tcsrev maps to strrev |
Each str*
function has a corresponding tcs*
function that should be used instead. See the TCHAR.H file for all the mapping and macros that are available. Just look up the online help for the string function in question in order to find the equivalent portable function.
Note: Do not use the str*
family of functions with Unicode strings, since Unicode strings are likely to contain embedded null bytes.
The next important point is that each literal string should be enclosed by the TEXT()
(or _T()
) macro. This macro prepends a "L" in front of literal strings if the project is being compiled in Unicode, or does nothing if MBCS or ASCII is being used. For instance, the string _T("Hello")
will be interpreted as "Hello"
in MBCS or ASCII, and L"Hello"
in Unicode. If you are working in Unicode and do not use the _T()
macro, you may get compiler warnings.
Note that you can use ASCII and Unicode within the same program, but not within the same string.
All MFC functions except for database class member functions are Unicode aware. This is because many database drivers themselves do not handle Unicode, and so there was no point in writing Unicode aware MFC classes to wrap these drivers.
Converting between Generic types and ASCII
ATL provides a bunch of very useful macros for converting between different character format. The basic form of these macros is X2Y()
, where X is the source format. Possible conversion formats are shown in the following table.
String Type | Abbreviation |
---|---|
ASCII (LPSTR) | A |
WIDE (LPWSTR) | W |
OLE (LPOLESTR) | OLE |
Generic (LPTSTR) | T |
Const | C |
Thus, A2W
converts an LPSTR
to an LPWSTR
, OLE2T
converts an LPOLESTR
to an LPTSTR
, and so on.
There are also const
forms (denoted by a C
) that convert to a const
string. For instance, A2CT
converts from LPSTR
to LPCTSTR
.
When using the string conversion macros you need to include the USES_CONVERSION
macro at the beginning of your function:
void foo(LPSTR lpsz) { USES_CONVERSION; ... LPTSTR szGeneric = A2T(lpsz) // Do something with szGeneric ... }
Two caveats on using the conversion macros:
- Never use the conversion macros inside a tight loop. This will cause a lot of memory to be allocated each time the conversion is performed, and will result in slow code. Better to perform the conversion outside the loop and pass the converted value into the loop.
- Never return the result of the macros directly from a function, unless the return value implies making a copy of the data before returning. For instance, if you have a function that returns an LPOLESTR, then do not do the following:
Collapse Copy Code
LPTSTR BadReturn(LPSTR lpsz) { USES_CONVERSION; // do something return A2T(lpsz); }
Instead, you should return the value as a CString, which would imply a copy of the string would be made before the function returns:
Collapse Copy CodeCString GoodReturn(LPSTR lpsz) { USES_CONVERSION; // do something return A2T(lpsz); }
Tips and Traps
The TRACE statement
The TRACE
macros have a few cousins - namely the TRACE0
, TRACE1
, TRACE2
and TRACE3
macros. These macros allow you to specify a format string (as in the normal TRACE
macro), and either 0,1,2 or 3 parameters, without the need to enclose your literal format string in the _T()
macro. For instance,
TRACE(_T("This is trace statement number %d/n"), 1);
can be written
TRACE1("This is trace statement number %d/n", 1);
Viewing Unicode strings in the debugger
If you are using Unicode in your applciation and wish to view Unicode strings in the debugger, then you will need to go to Tools | Options | Debug and click on "Display Unicode Strings".
The Length of strings
Be careful when performing operations that depend on the size or length of a string. For instance, CString::GetLength
returns the number of characters in a string, NOT the size in bytes. If you were to write the string to a CArchive
object, then you would need to multiply the length of the string by the size of each character in the string to get the number of bytes to write:
CString str = _T("Hello, World"); archive.Write( str, str.GetLength( ) * sizeof( TCHAR ) );
Reading and Writing ASCII text files
If you are using Unicode or MBCS then you need to be careful when writing ASCII files. The safest and easiest way to write text files is to use the CStdioFile
class provided with MFC. Just use the CString
class and the ReadString
and WriteString
member functions and nothing should go wrong. However, if you need to use the CFile
class and it's associated Read
and Write
functions, then if you use the following code:
CFile file(...); CString str = _T("This is some text"); file.Write( str, (str.GetLength()+1) * sizeof( TCHAR ) );
instead of
CStdioFile file(...); CString str = _T("This is some text"); file.WriteString(str);
then the results will be Significantly different. The two lines of text below are from a file created using the first and second code snippets respectively:
(This text was viewed using WordPad)
Not all structures use the generic text mappings
For instance, the CHARFORMAT
structure, if the RichEditControl version is less than 2.0, uses a char[]
for the szFaceName field, instead of a TCHAR
as would be expected. You must be careful not to blindly change "..." to _T("...")
without first checking. In this case, you would probably need to convert from TCHAR
to char before copying any data to the szFaceName field.
Copying text to the Clipboard
This is one area where you may need to use ASCII and Unicode in the same program, since the CF_TEXT
format for the clipboard uses ASCII only. NT systems have the option of the CF_UNICODETEXT
if you wish to use Unicode on the clipboard.
Installing the Unicode MFC libraries
The Unicode versions of the MFC libraries are not copied to your hard drive unless you select them during a Custom installation. They are not copied during other types of installation. If you attempt to build or run an MFC Unicode application without the MFC Unicode files, you may get errors.
(From the online docs) To copy the files to your hard drive, rerun Setup, choose Custom installation, clear all other components except "Microsoft Foundation Class Libraries," click the Details button, and select both "Static Library for Unicode" and "Shared Library for Unicode."