编码类型

Unicode

Unicode is a computing industry standard allowing computers to consistently represent and manipulate text expressed in most of the world's writing systems. Developed in tandem with the Universal Character Set standard and published in book form as The Unicode Standard, the latest version of Unicode consists of

^la repertoire of more than 107,000 characters covering 90 scripts,

^la set of code charts for visual reference,

^lan encoding methodology and set of standard character encodings,

^lan enumeration of character properties such as upper and lower case,

^la set of reference data computer files,

^la number of related items, such as character properties, rules for normalization, decomposition, collation, rendering, and bidirectional display order (for the correct display of text containing both right-to-left scripts, such as Arabic or Hebrew, and left-to-right scripts).

Character encoding

In computer science, the terms character encoding, character set, and sometimes character map or code page were historically synonymous, as the same standard would specify a repertoire of characters and how they were to be encoded into a stream of code units — usually with a single character per code unit. The terms now have related but distinct meanings, reflecting the efforts of standards bodies to use precise terminology when writing about and unifying many different encoding systems.^[1] Regardless, the terms are still used interchangeably, with character set being nearly ubiquitous.

Unicode encoding model

Unicode and its parallel standard, the ISO/IEC 10646 Universal Character Set together constitute a modern, unified character encoding. Rather than mapping characters directly to octets (bytes), they separately define what characters are available, their numbering, how those numbers are encoded as a series of "code units" (limited-size numbers), and finally how those units are encoded as a stream of octets. The idea behind this decomposition is to establish a universal set of characters that can be encoded in a variety of ways.^[1] To correctly describe this model one needs more precise terms than "character set" and "character encoding". The terms used in the modern model follow:^[1]

A character repertoire is the full set of abstract characters that a system supports. The repertoire may be closed, i.e. no additions are allowed without creating a new standard (as is the case with ASCII and most of the ISO-8859 series), or it may be open, allowing additions (as is the case with Unicode and to a limited extent the Windows code pages). The characters in a given repertoire reflect decisions that have been made about how to divide writing systems into linear information units. The basic variants of the Latin, Greek, and Cyrillic alphabets, can be broken down into letters, digits, punctuation, and a few special characters like the space, which can all be arranged in simple linear sequences that are displayed in the same order they are read. Even with these alphabets however diacritics pose a complication: they can be regarded either as part of a single character containing a letter and diacritic (known in modern terminology as a precomposed character), or as separate characters. The former allows a far simpler text handling system but the latter allows any letter/diacritic combination to be used in text. Other writing systems, such as Arabic and Hebrew, are represented with more complex character repertoires due to the need to accommodate things like bidirectional text and glyphs that are joined together in different ways for different situations.

A coded character set specifies how to represent a repertoire of characters using a number of non-negative integer codes called code points. For example, in a given repertoire, a character representing the capital letter "A" in the Latin alphabet might be assigned to the integer 65, the character for "B" to 66, and so on. A complete set of characters and corresponding integers is a coded character set. Multiple coded character sets may share the same repertoire; for example ISO/IEC 8859-1 and IBM code pages 037 and 500 all cover the same repertoire but map them to different codes. In a coded character set, each code point only represents one character, i.e., a coded character set is a function.

A character encoding form (CEF) specifies the conversion of a coded character set's integer codes into a set of limited-size integer code valuesthat facilitate storage in a system that represents numbers in binary form using a fixed number of bits (i.e. practically any computer system). For example, a system that stores numeric information in 16-bit units would only be able to directly represent integers from 0 to 65,535 in each unit, but larger integers could be represented if more than one 16-bit unit could be used. This is what a CEF accommodates: it defines a way of mapping a single code point from a range of, say, 0 to 1.4 million, to a series of one or more code values from a range of, say, 0 to 65,535.

The simplest CEF system is simply to choose large enough units that the values from the coded character set can be encoded directly (one code point to one code value). This works well for coded character sets that fit in 8 bits (as most legacy non-CJK encodings do) and reasonably well for coded character sets that fit in 16 bits (such as early versions of Unicode). However, as the size of the coded character set increases (e.g. modern Unicode requires at least 21 bits/character), this becomes less and less efficient, and it is difficult to adapt existing systems to use larger code values. Therefore, most systems working with later versions of Unicode use either UTF-8, which maps Unicode code points to variable-length sequences of octets, or UTF-16/UCS-2, which maps Unicode code points to variable-length sequences of 16-bit words.

Next, a character encoding scheme (CES) specifies how the fixed-size integer code values should be mapped into an octet sequence suitable for saving on an octet-based file system or transmitting over an octet-based network. With Unicode, a simple character encoding scheme is used in most cases, simply specifying whether the bytes for each integer should be in big-endian or little-endian order (even this isn't needed with UTF-8). However, there are also compound character encoding schemes, which use escape sequences to switch between several simple schemes (such as ISO/IEC 2022), and compressing schemes, which try to minimise the number of bytes used per code unit (such as SCSU, BOCU, and Punycode).

Finally, there may be a higher level protocol which supplies additional information that can be used to select the particular variant of a Unicode character, particularly where there are regional variants that have been 'unified' in Unicode as the same character. An example is the XML attribute xml:lang.

The Unicode model reserves the term character map for historical systems which directly assign a sequence of characters to a sequence of bytes.^[1] Such systems include entities which IBM's Character Data Representation Architecture (CDRA) designates with coded character set identifiers (CCIDs) and each of which is variously called a charset, character set, code page, or CHARMAP.^[1] The term charset is also used for similar mappings by MIME and systems based on it.^[1]

Popular character encodings

ISO 8859:

ISO 8859-1 Western Europe
ISO 8859-2 Western and Central Europe

 Chinese Guobiao

GB 2312
GBK (Microsoft Code page 936)
GB 18030

 Taiwan Big5 (a more famous variant is Microsoft Code page 950)

 Hong Kong HKSCS

 Korean

Universal Character Set

The Universal Character Set (UCS), defined by the International Standard ISO/IEC 10646, Information technology — Universal multiple-octet coded character set (UCS) (plus amendments to that standard), is a standard set of characters upon which many character encodings are based.

Mapping of Unicode character planes

The Unicode characters can be categorized in many different ways, Unicode code points can be logically divided into 17 planes, each with 65,536 (= 2¹⁶) code points, although currently only a few planes are used:

Plane 0 (0000–FFFF): Basic Multilingual Plane (BMP). This is the plane containing most of the character assignments so far. A primary objective for the BMP is to support the unification of prior character sets as well as characters for writing systems in current use.
Plane 1 (10000–1FFFF): Supplementary Multilingual Plane (SMP).
Plane 2 (20000–2FFFF): Supplementary Ideographic Plane (SIP)
Planes 3 to 13 (30000–DFFFF) are unassigned
Plane 14 (E0000–EFFFF): Supplementary Special-purpose Plane (SSP)
Plane 15 (F0000–FFFFF) reserved for the Private Use Area (PUA)
Plane 16 (100000–10FFFF), reserved for the Private Use Area (PUA)

Currently, about ten percent of the potential space is used. Furthermore, ranges of characters have been tentatively blocked out for every current and ancient writing system (script) the Unicode consortium has been able to identify: (see [1]). While Unicode may eventually need to use another of the spare 11 planes for ideographic characters, other planes remain, if previously unknown scripts with tens of thousands of characters are discovered. This 21-bit limit is therefore unlikely to be reached in the near future.

The first plane (plane 0), the Basic Multilingual Plane (BMP),[Chandler1] is where most characters have been assigned so far. The BMP contains characters for almost all modern languages, and a large number of special characters. Most of the allocated code points in the BMP are used to encode Chinese, Japanese, and Korean (CJK) characters.

(edit template)

Legend:
Unicode 1.0	Unicode 4.0
Unicode 1.1	Unicode 4.1
Unicode 2.0	Unicode 5.0
Unicode 2.1	Unicode 5.1
Unicode 3.0	Unicode 5.2
Unicode 3.1	Reserved
Unicode 3.2	Noncharacter

Unicode characters
BMP		SMP	SIP		SSP
0000–0FFF	8000–8FFF	10000–10FFF	20000–20FFF	28000–28FFF	E0000–E0FFF
1000–1FFF	9000–9FFF	11000–11FFF	21000–21FFF	29000–29FFF
2000–2FFF	A000–AFFF	12000–12FFF	22000–22FFF	2A000–2AFFF
3000–3FFF	B000–BFFF	13000–13FFF	23000–23FFF	2B000–2BFFF
4000–4FFF	C000–CFFF		24000–24FFF
5000–5FFF	D000–DFFF	1D000–1DFFF	25000–25FFF
6000–6FFF	E000–EFFF		26000–26FFF
7000–7FFF	F000–FFFF	1F000–1FFFF	27000–27FFF	2F000–2FFFF

Note: Unicode characters visualization will depend on the character support of your web browser and the fonts installed on your system.

U+

6000

怀

态

怂

怃

怄

怅

怆

怇

怈

怉

怊

怋

怌

怍

怎

怏

6010

怐

怑

怒

怓

怔

怕

怖

怗

怘

怙

怚

怛

怜

思

怞

怟

6020

怠

怡

怢

怣

怤

急

怦

性

怨

怩

怪

怫

怬

怭

怮

怯

6030

怰

怱

怲

怳

怴

怵

怶

怷

怸

怹

怺

总

怼

怽

怾

怿

6040

恀

恁

恂

恃

恄

恅

恆

恇

恈

恉

恊

恋

恌

恍

恎

恏

6050

恐

恑

恒

恓

恔

恕

恖

恗

恘

恙

恚

恛

恜

恝

恞

恟

6060

恠

恡

恢

恣

恤

恥

恦

恧

恨

恩

恪

恫

恬

恭

恮

息

6070

恰

恱

恲

恳

恴

恵

恶

恷

恸

恹

恺

恻

恼

恽

恾

恿

6080

悀

悁

悂

悃

悄

悅

悆

悇

悈

悉

悊

悋

悌

悍

悎

悏

6090

悐

悑

悒

悓

悔

悕

悖

悗

悘

悙

悚

悛

悜

悝

悞

悟

60A0

悠

悡

悢

患

悤

悥

悦

悧

您

悩

悪

悫

悬

悭

悮

悯

60B0

悰

悱

悲

悳

悴

悵

悶

悷

悸

悹

悺

悻

悼

悽

悾

悿

60C0

惀

惁

惂

惃

惄

情

惆

惇

惈

惉

惊

惋

惌

惍

惎

惏

60D0

惐

惑

惒

惓

惔

惕

惖

惗

惘

惙

惚

惛

惜

惝

惞

惟

60E0

惠

惡

惢

惣

惤

惥

惦

惧

惨

惩

惪

惫

惬

惭

惮

惯

60F0

惰

惱

惲

想

惴

惵

惶

惷

惸

惹

惺

惻

惼

惽

惾

惿

GB18030-2000

Introduction

GB18030-2000 is a new character set standard from the PRC that specifies an extended codepage and a mapping table to Unicode.

On March 17, 2000, the Chinese government issued regulations mandating that all operating systems on non-handheld computers sold in the PRC after January 1, 2001 would have to comply with the new multibyte GB18030-2000 standard. However, the initial implementation deadline of January 1, 2001 was later postponed until September 1, 2001.

Evolution of GB18030-2000

All character set standards that originate in the PRC have designations that begin with "GB". GB is an abbreviation for Guojia Biaozhun, meaning "national standard". The GB 2312-1980 character set standard was established in 1981 to represent simplified Chinese characters. GB 2312-1980 is a coded character set that contains 7,445 characters, including 6,763 Hanzi and 682 non-Hanzi characters. With the release of ISO 10646-1/Unicode 2.1 in 1993, the PRC expressed its fundamental consent to support the combined efforts of the ISO/IEC and the Unicode Consortium through publishing a Chinese National Standard that was code- and character-compatible with ISO 10646-1/Unicode 2.1. This standard was named GB 13000.1. Whenever the ISO and the Unicode Consortium changed or revised their common standard, GB 13000.1 subsequently adopted these changes.

To accommodate all additional Hanzi characters specified in GB 13000.1 that are not included in GB 2312-1980, a new specification known as GBK was then introduced. GBK is an abbreviation for "Guojia biaozhun kuozhan", which is the Chinese for "Rules/Specifications defining the extensions of internal codes for Chinese ideograms". GBK is an extension of GB 2312-1980 and the key significant property of GBK is that it leaves the characters and codes as defined in GB 2312-1980 untouched and positions all additional characters around it. The additional characters are mainly those of the Unified Han portion of Unicode 2.1 that go beyond the character repertoire of GB 2312-1980. Thus, code and character compatibility between GBK and GB 2312-1980 is ensured while, at the same time, the complete Unicode Unified Han character set is made available. At the time when GBK was defined, other characters were added that were not available in Unicode.

GBK defines 23,940 code points containing 21,886 characters. At the same time, GBK provides mappings to the code points of Unicode 2.1. However, due to the packed code space used to define GBK, it became obvious that there was no space left for a major addition. The 1,894 code points of GBK's three user-defined areas were not even close to providing sufficient space for the CJK Unified Ideographs Extension A, which defines 6,582 new characters in plane 0 of Unicode, version 3.0, the Basic Multilingual Plane (BMP).

Therefore, GB18030-2000 was created as an update of GBK for Unicode 3.0 with an extension that covers all of Unicode. It is fully backward-compatible with GB 2312-1980 and GBK. The mapping table from GB18030-2000 to Unicode is backward-compatible with the mapping table from GB 2312-1980 to Unicode, however, the GBK to Unicode table has a few differences. GBK contains characters which were not defined in Unicode 2.1, but were added in later versions of Unicode.

GB18030-2000 specifies a mapping table that covers all Unicode code points and maintains compatibility of GB-encoded text with GBK and GB 2312-1980.

GBK Encoding Ranges
range	byte 1	byte 2	code points (有多少个值可取)	Characters （实际采用了多少个code points）
range	byte 1	byte 2	code points (有多少个值可取)	GB 18030	GBK 1.0	Codepage 936	GB 2312
Level GBK/1	A1–A9	A1–FE	846	728	717	702	682
Level GBK/2	B0–F7	A1–FE	6,768	6,763		6,763	6,763
Level GBK/3	81–A0	40–FE except 7F	6,080	6,080		6,080
Level GBK/4	AA–FE	40–A0 except 7F	8,160	8,160		8,080
Level GBK/5	A8–A9	40–A0 except 7F	192	166		166
user-defined	AA–AF	A1–FE	564
user-defined	F8–FE	A1–FE	658
user-defined	A1–A7	40–A0 except 7F	672
total:			23,940	21,897	21,886	21,791	7,445

In graphical form, the following figure shows the space of all 64K possible 2-byte codes. Green and yellow areas are assigned GBK codepoints, red are for user-defined characters. The uncolored areas are invalid byte combinations.

从上面的两个图上可以分厂清晰的看出GB 18030、GBK 1.0、Microsoft Codepage 936、GB 2312这些编码的两个字节覆盖的取值范围。从下面的”Mapping Tables for Character Sets”中，也能看到，一个字节情况下128个gbk取值，还有后面的两个字节的情况，这里是以第一个字节为0x81为例，注意第二个字节的取值是从0x40开始的，0x8100到0x8139这个区间中是没有合法取值的，是空缺的位置。对应上面说到的“The uncolored areas are invalid byte combinations.”

Mapping Tables for Character Sets - GB2312

How to read this chart:

symbol
UTF-8 (hex)
UTF-16 (hex)

main

NUL
00
0000

STX
01
0001

SOT
02
0002

ETX
03
0003

EOT
04
0004

ENQ
05
0005

ACK
06
0006

BEL
07
0007

BS
08
0008

HT
09
0009

LF
0A
000A

VT
0B
000B

FF
0C
000C

CR
0D
000D

SOT
0E
000E

SI
0F
000F

DLE
10
0010

DC1
11
0011

DC2
12
0012

DC3
13
0013

DC4
14
0014

NAK
15
0015

SYN
16
0016

ETB
17
0017

CAN
18
0018

EM
19
0019

SUB
1A
001A

ESC
1B
001B

FS
1C
001C

GS
1D
001D

RS
1E
001E

US
1F
001F

SP
20
0020

!
21
0021

"
22
0022

#
23
0023

$
24
0024

%
25
0025

&
26
0026

'
27
0027

(
28
0028

)
29
0029

*
2A
002A

+
2B
002B

,
2C
002C

-
2D
002D

.
2E
002E

/
2F
002F

0
30
0030

1
31
0031

2
32
0032

3
33
0033

4
34
0034

5
35
0035

6
36
0036

7
37
0037

8
38
0038

9
39
0039

:
3A
003A

;
3B
003B

<
3C
003C

=
3D
003D

>
3E
003E

?
3F
003F

@
40
0040

A
41
0041

B
42
0042

C
43
0043

D
44
0044

E
45
0045

F
46
0046

G
47
0047

H
48
0048

I
49
0049

J
4A
004A

K
4B
004B

L
4C
004C

M
4D
004D

N
4E
004E

O
4F
004F

P
50
0050

Q
51
0051

R
52
0052

S
53
0053

T
54
0054

U
55
0055

V
56
0056

W
57
0057

X
58
0058

Y
59
0059

Z
5A
005A

[
5B
005B

\
5C
005C

]
5D
005D

^
5E
005E

_
5F
005F

`
60
0060

a
61
0061

b
62
0062

c
63
0063

d
64
0064

e
65
0065

f
66
0066

g
67
0067

h
68
0068

i
69
0069

j
6A
006A

k
6B
006B

l
6C
006C

m
6D
006D

n
6E
006E

o
6F
006F

p
70
0070

q
71
0071

r
72
0072

s
73
0073

t
74
0074

u
75
0075

v
76
0076

w
77
0077

x
78
0078

y
79
0079

z
7A
007A

{
7B
007B

|
7C
007C

}
7D
007D

~
7E
007E

DEL
7F
007F

€
E282AC
20AC

main | 81

丂
E4B882
4E02

丄
E4B884
4E04

丅
E4B885
4E05

丆
E4B886
4E06

丏
E4B88F
4E0F

丒
E4B892
4E12

丗
E4B897
4E17

丟
E4B89F
4E1F

丠
E4B8A0
4E20

両
E4B8A1
4E21

丣
E4B8A3
4E23

並
E4B8A6
4E26

丩
E4B8A9
4E29

丮
E4B8AE
4E2E

丯
E4B8AF
4E2F

丱
E4B8B1
4E31

丳
E4B8B3
4E33

丵
E4B8B5
4E35

丷
E4B8B7
4E37

丼
E4B8BC
4E3C

乀
E4B980
4E40

乁
E4B981
4E41

乂
E4B982
4E42

乄
E4B984
4E44

乆
E4B986
4E46

乊
E4B98A
4E4A

乑
E4B991
4E51

乕
E4B995
4E55

乗
E4B997
4E57

乚
E4B99A
4E5A

乛
E4B99B
4E5B

乢
E4B9A2
4E62

乣
E4B9A3
4E63

乤
E4B9A4
4E64

乥
E4B9A5
4E65

乧
E4B9A7
4E67

乨
E4B9A8
4E68

乪
E4B9AA
4E6A

乫
E4B9AB
4E6B

乬
E4B9AC
4E6C

乭
E4B9AD
4E6D

乮
E4B9AE
4E6E

乯
E4B9AF
4E6F

乲
E4B9B2
4E72

乴
E4B9B4
4E74

乵
E4B9B5
4E75

乶
E4B9B6
4E76

乷
E4B9B7
4E77

乸
E4B9B8
4E78

乹
E4B9B9
4E79

乺
E4B9BA
4E7A

乻
E4B9BB
4E7B

乼
E4B9BC
4E7C

乽
E4B9BD
4E7D

乿
E4B9BF
4E7F

亀
E4BA80
4E80

亁
E4BA81
4E81

亂
E4BA82
4E82

亃
E4BA83
4E83

亄
E4BA84
4E84

亅
E4BA85
4E85

亇
E4BA87
4E87

亊
E4BA8A
4E8A

亐
E4BA90
4E90

亖
E4BA96
4E96

亗
E4BA97
4E97

亙
E4BA99
4E99

亜
E4BA9C
4E9C

亝
E4BA9D
4E9D

亞
E4BA9E
4E9E

亣
E4BAA3
4EA3

亪
E4BAAA
4EAA

亯
E4BAAF
4EAF

亰
E4BAB0
4EB0

亱
E4BAB1
4EB1

亴
E4BAB4
4EB4

亶
E4BAB6
4EB6

亷
E4BAB7
4EB7

亸
E4BAB8
4EB8

亹
E4BAB9
4EB9

亼
E4BABC
4EBC

亽
E4BABD
4EBD

亾
E4BABE
4EBE

仈
E4BB88
4EC8

仌
E4BB8C
4ECC

仏
E4BB8F
4ECF

仐
E4BB90
4ED0

仒
E4BB92
4ED2

仚
E4BB9A
4EDA

仛
E4BB9B
4EDB

仜
E4BB9C
4EDC

仠
E4BBA0
4EE0

仢
E4BBA2
4EE2

仦
E4BBA6
4EE6

仧
E4BBA7
4EE7

仩
E4BBA9
4EE9

仭
E4BBAD
4EED

仮
E4BBAE
4EEE

仯
E4BBAF
4EEF

仱
E4BBB1
4EF1

仴
E4BBB4
4EF4

仸
E4BBB8
4EF8

仹
E4BBB9
4EF9

仺
E4BBBA
4EFA

仼
E4BBBC
4EFC

仾
E4BBBE
4EFE

伀
E4BC80
4F00

伂
E4BC82
4F02

伃
E4BC83
4F03

伄
E4BC84
4F04

伅
E4BC85
4F05

伆
E4BC86
4F06

伇
E4BC87
4F07

伈
E4BC88
4F08

伋
E4BC8B
4F0B

伌
E4BC8C
4F0C

伒
E4BC92
4F12

伓
E4BC93
4F13

伔
E4BC94
4F14

伕
E4BC95
4F15

伖
E4BC96
4F16

伜
E4BC9C
4F1C

伝
E4BC9D
4F1D

伡
E4BCA1
4F21

伣
E4BCA3
4F23

伨
E4BCA8
4F28

伩
E4BCA9
4F29

伬
E4BCAC
4F2C

伭
E4BCAD
4F2D

伮
E4BCAE
4F2E

伱
E4BCB1
4F31

伳
E4BCB3
4F33

伵
E4BCB5
4F35

伷
E4BCB7
4F37

伹
E4BCB9
4F39

伻
E4BCBB
4F3B

伾
E4BCBE
4F3E

伿
E4BCBF
4F3F

佀
E4BD80
4F40

佁
E4BD81
4F41

佂
E4BD82
4F42

佄
E4BD84
4F44

佅
E4BD85
4F45

佇
E4BD87
4F47

佈
E4BD88
4F48

佉
E4BD89
4F49

佊
E4BD8A
4F4A

佋
E4BD8B
4F4B

佌
E4BD8C
4F4C

佒
E4BD92
4F52

佔
E4BD94
4F54

佖
E4BD96
4F56

佡
E4BDA1
4F61

佢
E4BDA2
4F62

佦
E4BDA6
4F66

佨
E4BDA8
4F68

佪
E4BDAA
4F6A

佫
E4BDAB
4F6B

佭
E4BDAD
4F6D

佮
E4BDAE
4F6E

佱
E4BDB1
4F71

佲
E4BDB2
4F72

併
E4BDB5
4F75

佷
E4BDB7
4F77

佸
E4BDB8
4F78

佹
E4BDB9
4F79

佺
E4BDBA
4F7A

佽
E4BDBD
4F7D

侀
E4BE80
4F80

侁
E4BE81
4F81

侂
E4BE82
4F82

侅
E4BE85
4F85

來
E4BE86
4F86

侇
E4BE87
4F87

侊
E4BE8A
4F8A

侌
E4BE8C
4F8C

侎
E4BE8E
4F8E

侐
E4BE90
4F90

侒
E4BE92
4F92

侓
E4BE93
4F93

侕
E4BE95
4F95

侖
E4BE96
4F96

侘
E4BE98
4F98

侙
E4BE99
4F99

侚
E4BE9A
4F9A

侜
E4BE9C
4F9C

侞
E4BE9E
4F9E

侟
E4BE9F
4F9F

価
E4BEA1
4FA1

侢
E4BEA2
4FA2

UTF-8

UTF-8 (8-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode. It is able to represent any character in the Unicode standard.

Unicode	Byte1	Byte2	Byte3	Byte4	example
U+0000–U+007F	0xxxxxxx				'$' U+0024 → 00100100 → 0x24
U+0080–U+07FF	110yyyxx	10xxxxxx			'¢' U+00A2 → 11000010,10100010 → 0xC2,0xA2
U+0800–U+FFFF	1110yyyy	10yyyyxx	10xxxxxx		'€' U+20AC → 11100010,10000010,10101100 → 0xE2,0x82,0xAC
U+10000–U+10FFFF	11110zzz	10zzyyyy	10yyyyxx	10xxxxxx	'𤭢' U+024B62 → 11110000,10100100,10101101,10100010 → 0xF0,0xA4,0xAD,0xA2

UTF-16/UCS-2

In computing, UTF-16 (16-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode, capable of encoding the entire Unicode repertoire.

UCS-2(2-byte Universal Character Set) The UCS-2 encoding form is identical to that of UTF-16, except that it does not support surrogate pairs and therefore can only encode characters in the BMP range U+0000 through U+FFFF.

Encoding of characters outside the BMP

The improvement that UTF-16 made over UCS-2 is its ability to encode characters in planes 1–16, not just those in plane 0 (BMP). This was done by taking an unassigned portion of the 16 bit UCS-2 space, shown to scale by color here:

||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

注意：在上面的BMP（mapping of Unicode character planes）中0xD8到0xF8都是“utf-16 surrogates and private use”，这里的前面8个区域就是用来支持utf-16表示BMP以外的unicode编码用的。

	DC00	DC01	…	DFFF
D800	010000	010001	…	0103FF
D801	010400	010401	…	0107FF
⋮			⋱	⋮
DBFF	10FC00	10FC01	…	10FFFF

UTF-16 represents non-BMP characters (those from U+10000 through U+10FFFF) using a pair of 16-bit words, known as a surrogate pair. First 10000₁₆ is subtracted from the code point to give a 20-bit value. This is then split into two separate 10-bit values each of which is represented as a surrogate with the most significant half placed in the first surrogate. To allow safe use of simple word-oriented string processing, separate ranges of values are used for the two surrogates: 0xD800–0xDBFF for the first, most significant surrogate (marked brown) and 0xDC00-0xDFFF for the second, least significant surrogate (marked azure).

For example, the character at code point U+10000 becomes the code unit sequence 0xD800 0xDC00, and the character at U+10FFFD, the upper limit of Unicode, becomes the sequence 0xDBFF 0xDFFD.[Chandler2] Unicode and ISO/IEC 10646 do not, and will never, assign characters to any of the code points in the U+D800–U+DFFF range, so an individual code value from a surrogate pair does not ever represent a character.

[Chandler1]

下面网址提供了unicode BMP的查询

http://www.atm.ox.ac.uk/user/iwi/charmap.html

[Chandler2]U+10FFFD减去0x10000得到0xFFFFD，分成两个10bits的half，第一个是0x3FF，第二个是0x3FD，最终：

1st surrogate：0xD800+0x3FF=0xDBFF

2nd surrogate：0xDFFD+0x3FD=0xFFFD

posted on 2014-08-01 13:48 Newbie wang 阅读(2103) 评论(0) 编辑收藏举报

刷新页面返回顶部

Newbie wang