参考Pylons下unicode介绍:http://wiki.pylonshq.com/display/pylonsdocs/Unicode
1.3 Unicode Literals in Python Source Code
In Python source code, Unicode literals are written as strings prefixed with the 'u' or 'U' character:
1 |
>>> u'abcdefghijk' |
You can also use ", """` or ''' versions too. For example:
1 |
>>> u"""This |
Specific code points can be written using the \u escape sequence, which is followed by four hex digits giving the code point. If you use \U instead you specify 8 hex digits instead of 4. Unicode literals can also use the same escape sequences as 8-bit strings, including \x, but \x only takes two hex digits so it can't express all the available code points. You can add characters to Unicode strings using the unichr() built-in function and find out what the ordinal is with ord().
Here is an example demonstrating the different alternatives:
1 |
>>> s = u"\x66\u0072\u0061\U0000006e" + unichr(231) + u"ais" |
Using escape sequences for code points greater than 127 is fine in small doses but Python 2.4 and above support writing Unicode literals in any encoding as long as you declare the encoding being used by including a special comment as either the first or second line of the source file:
1 |
#!/usr/bin/env python |
If you don't include such a comment, the default encoding used will be ASCII. Versions of Python before 2.4 were Euro-centric and assumed Latin-1 as a default encoding for string literals; in Python 2.4, characters greater than 127 still work but result in a warning. For example, the following program has no encoding declaration:
1 |
#!/usr/bin/env python |
When you run it with Python 2.4, it will output the following warning:
1 |
sys:1: DeprecationWarning: Non-ASCII character '\xe9' in file testas.py on line |
and then the following output:
1 |
233 |
For real world use it is recommended that you use the UTF-8 encoding for your file but you must be sure that your text editor actually saves the file as UTF-8 otherwise the Python interpreter will try to parse UTF-8 characters but they will actually be stored as something else.
Note
Windows users who use the SciTE editor can specify the encoding of their file from the menu using the File->Encoding.
Note
If you are working with Unicode in detail you might also be interested in the unicodedata module which can be used to find out Unicode properties such as a character's name, category, numeric value and the like.
2 Applying this to Web Programming
So far we've seen how to use encoding in source files and seen how to decode text to Unicode and encode it back to text. We've also seen that Unicode objects can be manipulated in similar ways to strings and we've seen how to perform input and output operations on files. Next we are going to look at how best to use Unicode in a web app.
The main rule is this:
Your application should use Unicode for all strings internally, decoding any input to Unicode as soon as it enters the application and encoding the Unicode to UTF-8 or another encoding only on output.
If you fail to do this you will find that UnicodeDecodeError s will start popping up in unexpected places when Unicode strings are used with normal 8-bit strings because Python's default encoding is ASCII and it will try to decode the text to ASCII and fail. It is always better to do any encoding or decoding at the edges of your application otherwise you will end up patching lots of different parts of your application unnecessarily as and when errors pop up.
Unless you have a very good reason not to it is wise to use UTF-8 as the default encoding since it is so widely supported.
The second rule is:
Always test your application with characters above 127 and above 255 wherever possible.
If you fail to do this you might think your application is working fine, but as soon as your users do put in non-ASCII characters you will have problems. Using arabic is always a good test and www.google.ae is a good source of sample text.
The third rule is:
Always do any checking of a string for illegal characters once it's in the form that will be used or stored, otherwise the illegal characters might be disguised.
For example, let's say you have a content management system that takes a Unicode filename, and you want to disallow paths with a '/' character. You might write this code:
1 |
def read_file(filename, encoding): |
This is INCORRECT. If an attacker could specify the 'base64' encoding, they could pass L2V0Yy9wYXNzd2Q= which is the base-64 encoded form of the string '/etc/passwd' which is a file you clearly don't want an attacker to get hold of. The above code looks for / characters in the encoded form and misses the dangerous character in the resulting decoded form.
Those are the three basic rules so now we will look at some of the places you might want to perform Unicode decoding in a Pylons application.