pandas读文件出现错误UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: unexpected end of data

用pandas读一个txt文件,

data = pd.read_table(os.path.join(project_path, 'src/data/corpus.txt'), sep='\n')

出现如下错误:
'utf-8' codec can't decode bytes in position 0-1: unexpected end of data

这个错误的原因是:

you cannot randomly partition the bytes you've received and then ask UTF-8 to decode it. UTF-8 is a multibyte encoding, meaning you can have anywhere from 1 to 6 bytes to represent one character. If you chop that in half, and ask Python to decode it, it will throw you the unexpected end of data error.

也就是说,UTF-8是多字节编码,1-6位表示一个character,不能随意切分然后要python去解码。

解决方法:

  1. 如果是因为文中有汉字,出现编码问题。这种情况应该加上encodings ='utf-8'
  2. 路径里面有中文。这种情况应该确保路径都为英文字母
  3. 如果不是,根据github的这个讨论:https://github.com/pandas-dev/pandas/issues/43540 ,可以加上参数encoding_errors。
data = pd.read_table(os.path.join(project_path, 'src/data/corpus.txt'), sep='\n', encoding_errors='ignore')```
posted @ 2023-03-13 14:08  地球美好不  阅读(925)  评论(0编辑  收藏  举报