Notice: Unicode 只是分配整数给字符的编码表。现有若干种将一串字符表示为一串字节的方法。最显而易见的两种方法是将 Unicode 文本存储为 2 个 或 4 个字节序列的串。这两种方法的正式名称分别为 UCS-2 和 UCS-4. 在 Unix 下使用 UCS-2 (或 UCS-4) 会导致非常严重的问题. 用这些编码的字符串会包含一些特殊的字符, 比如 '\0' 或 '/', 它们在 文件名和其他 C 库函数参数里都有特别的含义. 另外, 大多数使用 ASCII 文件的 UNIX 下的工具, 如果不进行重大修改是无法读取 16 位的字符的. 基于这些原因, 在文件名, 文本文件, 环境变量等地方, UCS-2 不适合作为 Unicode 的外部编码.UTFx也是一种把Unicode字符表示为一组字节的方法,它就没有UCS-2和UCS-4的缺点。

Unicode:

Unicode is a computing industry standard allowing computers to consistently represent and manipulate text expressed in most of the world's writing systems. Developed in tandem with the Universal Character Set standard and published in book form as The Unicode Standard, Unicode consists of a repertoire of more than 100,000 characters, a set of code charts for visual reference, an encoding methodology and set of standard character encodings, an enumeration of character properties such as upper and lower case, a set of reference data computer files, and a number of related items, such as character properties, rules for normalization, decomposition, collation, rendering and bidirectional display order (for the correct display of text containing both right-to-left scripts, such as Arabic or Hebrew, and left-to-right scripts).

Unicode defines a codespace of 1,114,112 code points in the range 0x0 hex to 0x10FFFF, most of which are available for encoding of characters. The majority of the common characters used in the major languages of the world are encoded in the first 65,536 code points, also known as the Basic Multilingual Plane (BMP). It is normal to reference a Unicode code point by writing "U+" followed by its hexadecimal number. For code points in the Basic Multilingual Plane (BMP), four digits are used (e.g. U+0058 for the character LATIN CAPITAL LETTER X); for code points outside the BMP, five or six digits are used, as required (e.g. U+E0001 for the character LANGUAGE TAG and U+10FFFD for the character PRIVATE USE CHARACTER-10FFFD).

注:Unicode并不总是使用2个字节表示字符的,BMP只包含主要语言的大多数共用字符,可以使用2个字节表示,但是对于BMP之外的字符就无法用2个字节来表示,现在是要用21bit(0x10FFFF),不太到3个字节,但为了跟UCS-4保持一致,未来版本会扩充到 ISO 10646-1 实现级别3,即涵盖 UCS-4 的所有字符。UCS-4 是一个更大的尚未填充完全的31位字符集,加上恒为0的首位,共需占据32位,即4字节。理论上最多能表示 231 个字符,完全可以涵盖一切语言所用的符号。使用最多还是BMP中的字符。

注:Unicode 文件头标识:

  • EF BB BF UTF-8
  • FF FE UTF-16 aka UCS-2, little endian
  • FE FF UTF-16 aka UCS-2, big endian
  • 00 00 FF FE UTF-32 aka UCS-4, little endian
  • 00 00 FE FF UTF-32 aka UCS-4, big-endian

CJK:

中日韩统一表意文字(CJK Unified Ideographs),目的是要把分别来自中文、日文、韩文、越文中,本质、意义相同、形状一样或稍异的表意文字(主要为汉字,但也有仿汉字如日本国字、韩国独有汉字、越南的喃字)于ISO 10646及Unicode标准内赋予相同编码。《CJK统一汉字编码字符集》— 国家标准 GB13000.1 是完全等同于国际标准《通用多八位编码字符集 (UCS)》 ISO 10646.1。《GB13000.1》中最重要的也经常被采用的是其双字节形式的基本多文种平面。在这65536个码位的空间中,定义了几乎所有国家或地区的语言文字和符号。其中从0x4E00到 0x9FA5 的连续区域包含了 20902 个来自中国(包括台湾)、日本、韩国的汉字,称为 CJK (Chinese Japanese Korean) 汉字。CJK 是《GB2312-80》、《BIG5》等字符集的超集。

CJK是Unicode的一部分,它的范围是0x4E00到 0x9FA5(共20901个字符)。

Code Page

The text mode of standard (VGA-compatible) PC graphics hardware is built around using an 8-bit code page, though it is possible to use two at once with some color depth sacrifice, and up to 8 may be stored in the display adaptor for easy switching. There were a selection of code pages that could be loaded into such hardware. However, it is now commonplace for operating system vendors to provide their own character encoding and rendering systems that run in a graphics mode and bypass this system entirely. The character encodings used by these graphical systems (particularly MS-Windows) are sometimes called code pages as well.

 Windows的内核是Unicode编码,用来支持全世界所有的语言文字。Windows通过code   page来适应国家&地区。不同的语言往往有自己国家标准的编码方案,如简体中文的国家标准是GBK,繁体中文的标准是Big5。他们跟Unicode的表示字符的方法不一样,GBK致力于表达中文,它包含的字符要比Unicode的BMP多。相同的编码在不同的编码方案中表示的字符不一样,Windows就是使用Code Page来判定一个编码代表什么字符。

在Windows中的字符编码包括两种:Unicode和MBCS,Unicode可以用于内部表示,但是他并不适合传输,因此有了UTF8,UTF16和UTF32。MBCS包含很多种,比如UTFx系列应该都是,还有GBK,Big5,etc。

在C++中wchar_t表示unicode字符类型,一个字符占两个字节。