Python 3 读取不同编码的文本文件

1. 读取无BOM的UTF-8编码文件,open方法传入参数:encoding = 'utf-8'

2. 读取有BOM的UTF-8编码文件,open方法传入参数:encoding = 'utf-8-sig'

3. 读取无BOM的gbk编码文件,open方法传入参数:encoding = 'gbk'

万金油方法:

bytes = min(32, os.path.getsize(filename))
raw = open(filename, 'rb').read(bytes)
result = chardet.detect(raw)
encoding = result['encoding']

infile = open(filename, mode, encoding=encoding)
data = infile.read()
infile.close()

print(data)


参考资料:

Reading Unicode file data with BOM chars in Python
http://stackoverflow.com/questions/13590749/reading-unicode-file-data-with-bom-chars-in-python#comment18629764_13591421


在Python的API文档里有详细介绍:

All of these encodings can only encode 256 of the 1114112 codepoints defined in Unicode. A simple and straightforward way that can store each Unicode code point, is to store each codepoint as four consecutive bytes. There are two possibilities: store the bytes in big endian or in little endian order. These two encodings are called UTF-32-BE and UTF-32-LE respectively. Their disadvantage is that if e.g. you useUTF-32-BE on a little endian machine you will always have to swap bytes on encoding and decoding.UTF-32 avoids this problem: bytes will always be in natural endianness. When these bytes are read by a CPU with a different endianness, then bytes have to be swapped though. To be able to detect the endianness of aUTF-16 orUTF-32 byte sequence, there’s the so called BOM (“Byte Order Mark”). This is the Unicode characterU+FEFF. This character can be prepended to everyUTF-16 orUTF-32 byte sequence. The byte swapped version of this character (0xFFFE) is an illegal character that may not appear in a Unicode text. So when the first character in anUTF-16 or UTF-32 byte sequence appears to be aU+FFFE the bytes have to be swapped on decoding. Unfortunately the characterU+FEFF had a second purpose as aZERO WIDTHNO-BREAKSPACE: a character that has no width and doesn’t allow a word to be split. It can e.g. be used to give hints to a ligature algorithm. With Unicode 4.0 usingU+FEFF as a ZEROWIDTHNO-BREAK SPACE has been deprecated (withU+2060 (WORDJOINER) assuming this role). Nevertheless Unicode software still must be able to handleU+FEFF in both roles: as a BOM it’s a device to determine the storage layout of the encoded bytes, and vanishes once the byte sequence has been decoded into a string; as aZEROWIDTHNO-BREAK SPACE it’s a normal character that will be decoded like any other.

There’s another encoding that is able to encoding the full range of Unicode characters: UTF-8. UTF-8 is an 8-bit encoding, which means there are no issues with byte order in UTF-8. Each byte in a UTF-8 byte sequence consists of two parts: marker bits (the most significant bits) and payload bits. The marker bits are a sequence of zero to four1 bits followed by a0 bit. Unicode characters are encoded like this (with x being payload bits, which when concatenated give the Unicode character):

Range Encoding
U-00000000 ... U-0000007F 0xxxxxxx
U-00000080 ... U-000007FF 110xxxxx 10xxxxxx
U-00000800 ... U-0000FFFF 1110xxxx 10xxxxxx 10xxxxxx
U-00010000 ... U-0010FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

The least significant bit of the Unicode character is the rightmost x bit.

As UTF-8 is an 8-bit encoding no BOM is required and any U+FEFF character in the decoded string (even if it’s the first character) is treated as aZEROWIDTHNO-BREAK SPACE.

Without external information it’s impossible to reliably determine which encoding was used for encoding a string. Each charmap encoding can decode any random byte sequence. However that’s not possible with UTF-8, as UTF-8 byte sequences have a structure that doesn’t allow arbitrary byte sequences. To increase the reliability with which a UTF-8 encoding can be detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls"utf-8-sig") for its Notepad program: Before any of the Unicode characters is written to the file, a UTF-8 encoded BOM (which looks like this as a byte sequence:0xef,0xbb, 0xbf) is written. As it’s rather improbable that any charmap encoded file starts with these byte values (which would e.g. map to

LATIN SMALL LETTER I WITH DIAERESIS
RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
INVERTED QUESTION MARK

in iso-8859-1), this increases the probability that a utf-8-sig encoding can be correctly guessed from the byte sequence. So here the BOM is not used to be able to determine the byte order used for generating the byte sequence, but as a signature that helps in guessing the encoding. On encoding the utf-8-sig codec will write 0xef, 0xbb, 0xbf as the first three bytes to the file. On decodingutf-8-sig will skip those three bytes if they appear as the first three bytes in the file. In UTF-8, the use of the BOM is discouraged and should generally be avoided.



版权声明:本文为博主原创文章,未经博主允许不得转载。

posted @ 2015-10-13 21:27  包清骏  阅读(849)  评论(0编辑  收藏  举报