问题：python3 使用beautifulSoup时，出错UnicodeDecodeError: 'gbk' codec …….

想将html文件转为纯文本，用Python3调用beautifulSoup

超简单的代码一直出错，用于打开本地文件：

from bs4 import BeautifulSoup
file = open('index.html')
soup = BeautifulSoup(file,'lxml')
print (soup)

出现下面的错误

UnicodeDecodeError : ‘gbk’ codec can’t decode byte 0xff in position 0: illegal multibyte sequence

beautifulSoup不是自称可以解析各种编码格式的吗？为什么还会出现解析的问题？？？

搜了很多关于beautifulSoup的都没有解决，突然发现，如果把代码写成

from bs4 import BeautifulSoup
file = open('index.html')
str1 = file.read()  # 错误出在这一行！！！
soup = BeautifulSoup(str1,'lxml')
print (soup)

原来如此！ 问题出在文件读取而非BeautifulSoup的解析上！！

好吧，查查为什么文件读取有问题，直接上正解，同样四行代码

from bs4 import BeautifulSoup
file = open('index.html','r',encoding='utf-16-le')
soup = BeautifulSoup(file,'lxml')
print (soup)

然后soup.get_text()得到标签中的文字

如果文件中存在多种编码而且报错，可以采用下面这种方式忽略，没测试–

soup = BeautifulSoup(content.decode('utf-8','ignore'))

posted @ 2017-02-21 21:18 extendswind 阅读(3986) 评论(0) 收藏举报

刷新页面返回顶部

extendswind