Appendix A Unicode
Appendix A Unicode
// 附录A 统一字符编码标准
(unicode /ˈjuːnɪˌkəʊd/ n.a character set for all languages 统一字符编码标准. )
Computers use numbers. They store characters by assigning a number for each other. The original coding system was called ASCII (Amnerican Standard code for Inforamtion Interchange) and had 127 numbers (0 to 127) each stored as a 7-bit number. ASCII could satisfactorily handle lowercase and uppercase letters, digits, punctuation characters, and some control characters. An attempt was made to extended the ASCII character set to 8 bits. The new code, which was called Extended ASCII, was never internationally standardized.
To overcome the difficulties inherent in ASCII and Extended ASCII, the Unicode Consortium (a group of multilingual software manufacturers) created a universal encoding system to provide a comprehensive character set called Unicode.
Unicode was originally a 2-byte character set. Unicode version 3, however, is a 4-byte code and is fully compatibles with ASCII and Extended ASCII. The ASCII set, which is now called Basic Latin, is Unicode with the upper 25 bit set to zero. Extended ASCII, which is now called Latin-1, is Unicode with the 24 upper bits set to zero. Figure A .1 shows how the different systems are compatible.
// 计算机使用数字 .他们存储字符通过彼此赋予一个数字. 最开始的编码系统被叫做ASCII(美国标准信息交换代码), 每个被存储的有127位,作为一个数字. ASCII 令人满意地处理大写与小写字母, 数字、标点符号字符。 尝试将ASCII字符扩展为8位。 新的代码,被叫做扩展ASCII, 从来没有被国际标准化。
为了克服在ASCII和扩展ASCII内有的困难,统一字符编码标准联盟(一个关于多种语言软件开发商的组织)创建了一个通用的编码系统,提供了一个可以理解的字符集叫做统一字符编码标准。
统一字符编码标准原先是2个字节的字符集。统一字符编码标准版本3,无论如何,是一个4字节的编码, 并且充分兼容了 ASCII和扩展ASCII。
( assign /ə'saɪn/ vt. If you assign a particular function or value to someone or something, you said they have it. 赋予(某功能或价值) ;
satisfactorily /,sætɪs'fæktərəlɪ / adv. 令人满意的;
punctuation /pʌŋ(k)tʃʊ'eɪʃ(ə)n/ n. the marks used to divide a piece of writing into sentence, phrases etc. 标点符号,
inherent / ɪn'hɪər(ə)nt / adj. a qualilty that is inherent in something is a natural part of it and cannot be separated from it. 固有的,内在的,与生俱来的。
consortium /kən'sɔːtɪəm / n. A consortium is a group people or firms who have agreed to cooperate with each other. 联盟;
multilingual / mʌltɪ'lɪŋgw(ə)l / adj. multilingual means involving several different languages . 多种语言的;
manufacturer // ,mænjʊ'fæktʃ(ə)rə(r)/ n. a company that makes large quantities of goods. 制造商、制造公司、制造厂;
upper / ˈʌp.ər/ adj. at a higher position or level than something else, or being the top part of something. (位置或水平)较高的, 较上的, 上层的;
latin / ˈlæt.ɪn/ n. the language used by the ancient Romans and as the language of educated people in many European countries in the past. 拉丁语
compatible / / adj. able to exist, live together, or successfully with something or someone else. 可共存的,协调的、兼容的;
compatibility / kəmˌpætəˈbɪləti/ n. the ability of machines, especially computer, or computer programs to successfully with other machines or programs. 兼容
)
Each character or symbol in this code is defined by a 32-bit number. The code can define up to 232 (4,294,967,296) characters or symbols. The notation uses hexadecimal digits in the following format:
U-XXXXXXXX
Each X a hexadecimal digt. Therefore, the numbering goes from U-00000000 to U-FFFFFFFF .
A.1 PLANETS
Unicode divides the available space codes into planes. The most significant 16 bits define the plane, which means we can have 65,536 planes. Each plane can define up to 65,536 characters or symbols. Figure A .2 shows the structure of Unicode spaces and planes.
// 统一字符编码标准分隔可用空间成为一个平面。 最重要的16个位定义为平面,那意味着我们有65536个平面。 每个平面可以定义65536个字符或者符号。 图A.2显示了统一字符编码结构空间和平面。
(plane/plein/ n . specialized mathematics, in mathematics, a flat or level surface that continues in all directions. 平面;
)
Basic Multilingual Plane (BMP)
// 基本的多语言平台。
Plane (0000)16 , the basic multilingual plane (BMP), is designed to be compatible with the previous 16-bits . The most significant 16 bits in this plane are all zeros. The codes are normally shown as U+XXXX with the understanding that XXXX defines only least significant 16 bits. This plane mostly defines character set in different languages with exception of some codes used for control or other characters.
平面(0000)16 , 基本的多语言平面(BMP),被设计用来兼容先前的16位。 最重要的16位是全部为0的平面。
Other planes
There are some other (non-reserved) planes, that we briefly describe below:
Supplementary Multilingual plane(SMP)
Plane (0001)16 , the supplementary multilingual plane (SMP), is designed to provide more codes for those multilingual characters that are not includeed in the BMP.
Supplementary Ideographic Plane (SIP)
Plane (0002)16 , the supplementary multilingual plane (SIP), is designed to provide codes for ideographic symbols, symbols that primarily denote an idea (or meaning) in contrast to a sound (or pronunciation).
Supplementary Special Plane (SSP)
Plane (000E)16 , the supplementary special plane (SSP), is used for special characters.
Private Use Planes (PUPs)
Planes (000F) and (0010)16 , private use planes (PUPs) are for private use.
A.2 ASCII
The American Standard Code for Information Interchange (ASCII) is a 7-bit code that was designed to provide code for 128 symbols, mostly in American English. Today, ASCII, or Basic Latin, is part of unicode. It occupies the first 128 codes in Unicode (00000000 to 0000007F). Table A.1 contains the hexadecimal and graphic codes (symbols). The codes in hexadecimal just define the two least significant digits in unicode. To find the actual code, we prepend 00000000 in hexadecimal to the code.
Table A.1
Hex | Symbol | Hex | Symbol | Hex | Symbol | Hex | Symbol |
00 | NULL | 20 | SP | 40 | @ | 60 | ` |
01 | SOH | 21 | ! | 41 | A | 61 | a |
02 | STX | 22 | " | 42 | B | 62 | b |
03 | ETX | 23 | # | 43 | C | 63 | c |
04 | EOT | 24 | $ | 44 | D | 64 | d |
05 | ENQ | 25 | % | 45 | E | 65 | e |
06 | ACK | 26 | & | 46 | F | 66 | f |
07 | BEL | 27 | ' | 47 | G | 67 | g |
08 | BS | 28 | ( | 48 | H | 68 | h |
09 | HT | 29 | ) | 49 | I | 69 | i |
0A | LF | 2A | * | 4A | J | 6A | j |
0B | VT | 2B | + | 4B | K | 6B | k |
0C | FF | 2C | , | 4C | L | 6C | l |
0D | CR | 2D | - | 4D | M | 6D | m |
0E | SO | 2E | . | 4E | N | 6E | n |
0F | SI | 2F | / | 4F | O | 6F | o |
10 | DLE | 30 | 0 | 50 | P | 70 | p |
11 | DC1 | 31 | 1 | 51 | Q | 71 | q |
12 | DC2 | 32 | 2 | 52 | R | 72 | r |
13 | DC3 | 33 | 3 | 53 | S | 73 | s |
14 | DC4 | 34 | 4 | 54 | T | 74 | t |
15 | NAK | 35 | 5 | 55 | U | 75 | u |
16 | SYN | 36 | 6 | 56 | V | 76 | v |
17 | ETB | 37 | 7 | 57 | W | 77 | w |
18 | CAN | 38 | 8 | 58 | X | 78 | x |
19 | EM | 39 | 9 | 59 | Y | 79 | y |
1A | SUB | 3A | : | 5A | Z | 7A | z |
1B | ESC | 3B | ; | 5B | [ | 7B | { |
1C | FS | 3C | < | 5C | \ | 7C | | |
1D | GS | 3D | = | 5D | ] | 7D | } |
1E | RS | 3E | > | 5E | ^ | 7E | ~ |
1F | US | 3F | ? | 5F | _ | 7F | DEL |
Some properties of ASCII
ASCII has some interesting properties that we briefly mention here.
1. The space character (20)16 , is a printable character. It prints a blank space.
2. The uppercase letters start from (41)16 .The lowercase letters start from (61)16 . When compared, uppcase letters are numerically smaller than lowcase letters. This means that in a sorted list based ASCII values, the uppercase letters appear before the lowercase letters.
3.The uppercase letters and lowercase letters differ by only one bit in the 7-bit code. For example, character A is (1000001)2 and character a is (1100001)2 . The difference is in bit 6, which is 0 in uppercase letters and 1 in lowercase letters . If we know the code for one case, we can easily find the code for the other by adding or subtracting (20)16 , or we can just flip the sixth bit.
4.The uppercase letters are not immediately followed by lowercase letters. There are some punctuation characters in between.
5. Digits (0 to 9) start from (30)16 . This means that if you want to change a numeric character to its face value as an integer, you need to subtract (30)16 , = 48 from it.
6. The first 32 characters, (00)16 to (1F)16 , and last character, (7F)16 , are non-printable characters. Character (00)16 simply is used as a delimiter to define the end of string character. Character (7F)16 is the delete the previous character. The rest of the non-printable characters are refferred to as control characters and used in data communication. Table A.2 gives the description of characters.
// ASCII 中的特例
ASCII 有一些有趣的特性在这里我们简短地提一下:
1. 空白字符 (00)16 , 是一个可打印的字符, 它打印一个空白区域。
2.大写字母从(41)16 开始, 小写字母从(61)16 开始。 当对比时,大写字母比小写字母数值上小,这表明,在基于ASCII 值的存储列表里,大写字母比现在小写字母前。
3. 大写字母与小写字母不同之处仅在7位编码中的一位。 例如字符A 是(1000001)2 , 字符a 是(1100001)2 。 不同之处是在第6个比特位, 在大写字母是0而在小写字母是1。 如果我们知道了一个例子的代码,我们很容易用地发现其它与之对应的代码通过加或者减去(20)16 , 或者我们翻转第六个比特位。
4. 大写字母并不是紧随小写字母,在它们之间存在一些标点符号。
5. 数字(0到9)从(30)16 开始。 这意味着,如果你想去改变一个数字型字符的真值作为一个整型,你需要从中减去(30)16 =48 。
6. 最前面的32个字符,(00)16 到 (1F)16, 和最后一个字符(7F)16 , 是非打印字符,
(property /ˈprɒp.ə.ti / n. The properties of substance or object are the ways in which it behaves in particular conditions. (物质、物体的)特性、属性
flip /flip / V-T/V-I If something flips over, or if you flip it over or into a different position, it moves or is moved into a different position 翻转
)
Table A.2 ASCII Code
// 表2 ASCII 代码
Symbol | Interpretation | Symbol | Interpretation |
SOH | Start of heading | DC1 | Device control 1 |
STX | Start of text | DC2 | Device control 2 |
ETX | End of text | DC3 | Device control 3 |
EOT | End of transmission | DC4 | Device control 4 |
ENQ | Enquiry | NAK | Negative acknowlegment |
ACK | Acknowlegment | SYN | Synchronous idle |
BEL | Ring bell | ETB | End of transmisssion block |
BS | Backspace | CAN | Cancel |
HT | Horizontal tab | EM | End of medium |
LF | Line feed | SUB | subsitute |
VT | Vertical tab | ESC | Escape |
FF | Form feed | FS | File separator |
CR | Carriage return | GS | Group separator |
SO | Shift out | RS | Record separator |
SI | Shift in | US | Unit separator |
DLE | Data link escape |