Appendix A Unicode

Appendix A   Unicode

// 附录A    统一字符编码标准

(unicode /ˈjuːnɪˌkəʊd/   n.a character set for all languages  统一字符编码标准.   )

 

Computers  use numbers. They store characters by assigning a number for each other. The original coding system was called ASCII (Amnerican Standard code for Inforamtion Interchange) and had 127 numbers (0 to 127) each stored as a 7-bit number. ASCII could satisfactorily handle lowercase and uppercase letters, digits, punctuation characters, and some control characters. An attempt was made to extended the ASCII character set to 8 bits. The new code, which was called Extended ASCII, was never internationally standardized. 

  To overcome the difficulties inherent in ASCII and Extended ASCII, the Unicode Consortium (a group of multilingual software manufacturers) created a universal encoding system to provide a comprehensive character set called Unicode. 

  Unicode was originally a 2-byte character set. Unicode version 3, however, is a 4-byte code and is fully compatibles with ASCII and Extended ASCII. The ASCII set, which is now called Basic Latin, is Unicode with the upper 25 bit set to zero. Extended ASCII, which is now called Latin-1, is Unicode with the 24 upper bits set to zero. Figure A .1  shows how the different systems are compatible

//     计算机使用数字 .他们存储字符通过彼此赋予一个数字. 最开始的编码系统被叫做ASCII(美国标准信息交换代码), 每个被存储的有127位,作为一个数字.  ASCII 令人满意地处理大写与小写字母, 数字、标点符号字符。 尝试将ASCII字符扩展为8位。 新的代码,被叫做扩展ASCII, 从来没有被国际标准化。 

  为了克服在ASCII和扩展ASCII内有的困难,统一字符编码标准联盟(一个关于多种语言软件开发商的组织)创建了一个通用的编码系统,提供了一个可以理解的字符集叫做统一字符编码标准。

  统一字符编码标准原先是2个字节的字符集。统一字符编码标准版本3,无论如何,是一个4字节的编码, 并且充分兼容了 ASCII和扩展ASCII。 

  ( assign /ə'saɪn/   vt.  If you assign a particular function or value to someone or something, you said they have it. 赋予(某功能或价值) ; 

     satisfactorily /,sætɪs'fæktərəlɪ /  adv.  令人满意的;

     punctuation /pʌŋ(k)tʃʊ'eɪʃ(ə)n/     n. the marks  used to divide a piece of writing into sentence, phrases etc. 标点符号, 

     inherent / ɪn'hɪər(ə)nt  /   adj.  a qualilty that is inherent in something is a natural part of it and cannot be separated from it.  固有的,内在的,与生俱来的。

     consortium /kən'sɔːtɪəm /   n.  A consortium is a group people or firms who have agreed to cooperate with each other.  联盟;

     multilingual /  mʌltɪ'lɪŋgw(ə)l /   adj. multilingual means involving several different languages .  多种语言的;

      manufacturer // ,mænjʊ'fæktʃ(ə)rə(r)/  n. a company that makes large quantities of goods. 制造商、制造公司、制造厂;

     upper / ˈʌp.ər/  adj.   at a higher position or level than something else, or being the top part of something.   (位置或水平)较高的, 较上的, 上层的;

     latin / ˈlæt.ɪn/  n. the language used by the ancient Romans and as the language of educated people in many European countries in the past. 拉丁语

    compatible / /    adj. able to exist, live together, or successfully with something or someone else. 可共存的,协调的、兼容的;

    compatibility / kəmˌpætəˈbɪləti/ n.  the ability of machines, especially computer, or computer programs  to successfully with other machines or programs. 兼容

 

  Each character or symbol in this code is defined by a 32-bit number. The code can define up to 232  (4,294,967,296) characters or symbols.  The notation uses hexadecimal digits in the following format: 

U-XXXXXXXX

  Each X  a hexadecimal digt. Therefore, the numbering goes from U-00000000 to U-FFFFFFFF . 

 

A.1 PLANETS 

 

 

Unicode divides the available space codes into planes. The most significant 16 bits define the plane, which means we can have 65,536 planes. Each plane can define up to 65,536 characters or symbols. Figure A .2 shows the structure of Unicode spaces and planes.

// 统一字符编码标准分隔可用空间成为一个平面。 最重要的16个位定义为平面,那意味着我们有65536个平面。 每个平面可以定义65536个字符或者符号。 图A.2显示了统一字符编码结构空间和平面。 

 

(plane/plein/ n . specialized mathematics, in mathematics, a flat or level surface that continues in all directions.  平面;

 

Basic Multilingual Plane (BMP)

// 基本的多语言平台。

Plane (0000)16 , the basic multilingual plane (BMP), is designed to be compatible with the previous 16-bits . The most significant 16 bits in this plane are all zeros. The codes are normally shown as U+XXXX with the understanding that XXXX defines only least significant 16 bits. This plane mostly defines character set in different languages with exception of some codes used for control or other characters. 

 

平面(0000)16 , 基本的多语言平面(BMP),被设计用来兼容先前的16位。 最重要的16位是全部为0的平面。 

 

Other planes 

There are  some other (non-reserved) planes, that we briefly describe below: 

Supplementary Multilingual plane(SMP)

Plane (0001)16 , the supplementary multilingual plane (SMP), is designed to provide more codes for those multilingual characters that are not includeed in the BMP. 

Supplementary Ideographic Plane (SIP)

Plane (0002)16 , the supplementary  multilingual  plane (SIP), is designed to provide codes for ideographic symbols, symbols that primarily denote an idea (or meaning) in contrast to a sound (or pronunciation). 

Supplementary Special Plane (SSP) 

Plane (000E)16 , the supplementary special plane (SSP), is used for special characters. 

Private Use Planes (PUPs)

Planes (000F) and (0010)16 , private use planes (PUPs) are for private use. 

A.2 ASCII 

 

The American Standard Code for Information Interchange (ASCII) is a 7-bit code that was designed to provide code for 128 symbols, mostly in American English. Today, ASCII, or Basic Latin, is part of unicode. It occupies the first 128 codes in Unicode (00000000 to 0000007F). Table A.1 contains the hexadecimal and graphic codes (symbols). The codes in hexadecimal just define the two least significant digits in unicode. To find the actual code, we prepend 00000000 in hexadecimal to the code. 

Table A.1

Hex Symbol Hex Symbol Hex Symbol Hex Symbol
00 NULL  20  SP  40  @  60  `
01 SOH 21  41   A  61  a
02 STX 22  42  62 
03 ETX 23  43  63 
04 EOT 24  44  64 
05 ENQ 25  45  65 
06 ACK 26  46  66 
07 BEL 27  47  67 
08 BS 28  48  68 
09 HT 29  49  69 
0A LF 2A  4A  6A 
0B  VT 2B  4B  6B 
0C FF  2C  4C  6C 
0D CR  2D  4D  6D 
0E SO  2E  4E  6E 
0F SI  2F  4F  6F 
10 DLE  30  50  70 
11 DC1  31  51  71 
12 DC2  32  52  72 
13 DC3  33  53   S 73 
14 DC4  34  54  74 
15 NAK  35  55  75 
16 SYN  36  56  76 
17 ETB  37  57  77 
18 CAN  38  58  78 
19 EM  39  59  79 
1A SUB 3A : 5A Z 7A z
1B ESC 3B ; 5B [ 7B {
1C FS 3C < 5C \ 7C |
1D GS 3D = 5D ] 7D }
1E RS 3E > 5E ^ 7E ~
1F US 3F ? 5F _ 7F DEL

 

  

Some properties of ASCII

ASCII has some interesting properties that we briefly mention here. 

1. The space character (20)16  , is a printable character. It prints a blank space.

2. The uppercase letters start from (41)16 .The lowercase letters start from (61)16 . When compared, uppcase letters are numerically smaller than lowcase letters. This means that in a sorted list based ASCII values, the uppercase letters appear before the lowercase letters. 

3.The uppercase letters and lowercase letters differ by only one bit in the 7-bit code.  For example, character A is (1000001)and character a is (1100001). The difference is in  bit 6, which is 0 in uppercase letters and 1 in lowercase letters . If we know the code for one case, we can easily  find the code for the other by adding or subtracting (20)16 , or we can just flip the sixth bit. 

4.The uppercase letters are not immediately followed by lowercase letters. There are some punctuation characters in between.

5. Digits (0 to 9) start from (30)16 . This means that if you want to change a numeric character to its face value as an integer, you need to subtract (30)16 , = 48 from it. 

6. The first 32 characters, (00)16  to (1F)16 , and last character, (7F)16 , are non-printable characters. Character (00)16  simply is used as a delimiter to define the end of string character. Character (7F)16  is the delete the previous character. The rest of the non-printable characters  are refferred to as control characters and used in data communication. Table A.2  gives the description of characters. 

 

//  ASCII 中的特例

  ASCII 有一些有趣的特性在这里我们简短地提一下:

  1. 空白字符 (00)16 , 是一个可打印的字符, 它打印一个空白区域。

  2.大写字母从(41)16 开始, 小写字母从(61)16 开始。  当对比时,大写字母比小写字母数值上小,这表明,在基于ASCII 值的存储列表里,大写字母比现在小写字母前。 

  3. 大写字母与小写字母不同之处仅在7位编码中的一位。 例如字符A 是(1000001), 字符a 是(1100001)2 。 不同之处是在第6个比特位, 在大写字母是0而在小写字母是1。 如果我们知道了一个例子的代码,我们很容易用地发现其它与之对应的代码通过加或者减去(20)16 , 或者我们翻转第六个比特位。

  4. 大写字母并不是紧随小写字母,在它们之间存在一些标点符号。

  5. 数字(0到9)从(30)16 开始。 这意味着,如果你想去改变一个数字型字符的真值作为一个整型,你需要从中减去(30)16 =48 。

  6. 最前面的32个字符,(00)16  到 (1F)16, 和最后一个字符(7F)16 , 是非打印字符,

  (property /ˈprɒp.ə.ti /   n.  The properties of substance or object are the ways in which it behaves in particular conditions.  (物质、物体的)特性、属性

  flip /flip /   V-T/V-I If something flips over, or if you flip it over or into a different position, it moves or is moved into a different position  翻转

)

 

 

 

Table A.2 ASCII Code

// 表2 ASCII 代码

Symbol  Interpretation  Symbol Interpretation 
SOH Start of heading DC1 Device control 1
STX Start of text DC2 Device control 2
ETX End of text DC3 Device control 3
EOT End of transmission DC4 Device control 4
ENQ Enquiry NAK Negative acknowlegment
ACK Acknowlegment SYN Synchronous idle 
BEL Ring bell ETB End of transmisssion block
BS Backspace CAN Cancel
HT Horizontal tab EM End of medium
LF Line feed SUB subsitute
VT Vertical tab ESC Escape
FF Form feed FS File separator 
CR Carriage return GS Group separator 
SO Shift out RS Record separator 
SI Shift in  US Unit separator
DLE Data  link escape     

 

                                               

 

   

posted @ 2018-01-23 22:18  ZQXTXK  阅读(1194)  评论(0编辑  收藏  举报