坏小仔

  博客园  :: 首页  :: 新随笔  :: 联系 :: 订阅 订阅  :: 管理

URLs were designed to be portable. They were also designed to uniformly name all the resources on the Internet, which means that they will be transmitted through various protocols. Because all of these protocols have different mechanisms for transmitting their data, it was important for URLs to be designed so that they could be transmitted safely through any Internet protocol.

Safe transmission means that URLs can be transmitted without the risk of losing information. Some protocols, such as the Simple Mail Transfer Protocol (SMTP) for electronic mail, use transmission methods that can strip off certain characters.[4] To get around this, URLs are permitted to contain only characters from a relatively small, universally safe alphabet.

[4] This is caused by the use of a 7-bit encoding for messages; this can strip off information if the source is encoded in 8 bits or more.

In addition to wanting URLs to be transportable by all Internet protocols, designers wanted them to be readable by people. So invisible, nonprinting characters also are prohibited in URLs, even though these characters may pass through mailers and otherwise be portable.[5]

[5] Nonprinting characters include whitespace (note that RFC 2396 recommends that applications ignore whitespace).

To complicate matters further, URLs also need to be complete. URL designers realized there would be times when people would want URLs to contain binary data or characters outside of the universally safe alphabet. So, an escape mechanism was added, allowing unsafe characters to be encoded into safe characters for transport.

This section summarizes the universal alphabet and encoding rules for URLs

2.4.1 The URL Character Set

Default computer system character sets often have an Anglocentric bias. Historically, many computer applications have used the US-ASCII character set. US-ASCII uses 7 bits to represent most keys available on an English typewriter and a few nonprinting control characters for text formatting and hardware signalling.

US-ASCII is very portable, due to its long legacy. But while it's convenient to citizens of the U.S., it doesn't support the inflected characters common in European languages or the hundreds of non-Romanic languages read by billions of people around the world.

Furthermore, some URLs may need to contain arbitrary binary data. Recognizing the need for completeness, the URL designers have incorporated escape sequences. Escape sequences allow the encoding of arbitrary character values or data using a restricted subset of the US-ASCII character set, yielding portability and completeness.

2.4.2 Encoding Mechanisms

To get around the limitations of a safe character set representation, an encoding scheme was devised to represent characters in a URL that are not safe. The encoding simply represents the unsafe character by an "escape" notation, consisting of a percent sign (%) followed by two hexadecimal digits that represent the ASCII code of the character.

Table 2-2. Some encoded character examples

Character  ASCII code  Example URL

~ 126        (0x7E)          http://www.joes-hardware.com/%7Ejoe

SPACE      32 (0x20)     http://www.joes-hardware.com/more%20tools.html

%              37 (0x25)     http://www.joes-hardware.com/100%25satisfaction.html

posted on 2012-08-25 15:52  坏小仔  阅读(120)  评论(0编辑  收藏  举报