pandas读文件出现错误UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: unexpected end of data

用pandas读一个txt文件，

data = pd.read_table(os.path.join(project_path, 'src/data/corpus.txt'), sep='\n')

出现如下错误：
'utf-8' codec can't decode bytes in position 0-1: unexpected end of data

这个错误的原因是：

you cannot randomly partition the bytes you've received and then ask UTF-8 to decode it. UTF-8 is a multibyte encoding, meaning you can have anywhere from 1 to 6 bytes to represent one character. If you chop that in half, and ask Python to decode it, it will throw you the unexpected end of data error.

也就是说，UTF-8是多字节编码，1-6位表示一个character，不能随意切分然后要python去解码。

解决方法：

如果是因为文中有汉字，出现编码问题。这种情况应该加上encodings ='utf-8'
路径里面有中文。这种情况应该确保路径都为英文字母
如果不是，根据github的这个讨论：https://github.com/pandas-dev/pandas/issues/43540 ,可以加上参数encoding_errors。

data = pd.read_table(os.path.join(project_path, 'src/data/corpus.txt'), sep='\n', encoding_errors='ignore')```

posted @ 2023-03-13 14:08 地球美好不阅读(925) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

AudreyXu

pandas读文件出现错误UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: unexpected end of data

公告