pandas读文件出现错误UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: unexpected end of data
用pandas读一个txt文件,
data = pd.read_table(os.path.join(project_path, 'src/data/corpus.txt'), sep='\n')
出现如下错误:
'utf-8' codec can't decode bytes in position 0-1: unexpected end of data
这个错误的原因是:
you cannot randomly partition the bytes you've received and then ask UTF-8 to decode it. UTF-8 is a multibyte encoding, meaning you can have anywhere from 1 to 6 bytes to represent one character. If you chop that in half, and ask Python to decode it, it will throw you the unexpected end of data error.
也就是说,UTF-8是多字节编码,1-6位表示一个character,不能随意切分然后要python去解码。
解决方法:
- 如果是因为文中有汉字,出现编码问题。这种情况应该加上
encodings ='utf-8'
- 路径里面有中文。这种情况应该确保路径都为英文字母
- 如果不是,根据github的这个讨论:https://github.com/pandas-dev/pandas/issues/43540 ,可以加上参数encoding_errors。
data = pd.read_table(os.path.join(project_path, 'src/data/corpus.txt'), sep='\n', encoding_errors='ignore')```