关于BeautifulSoup中字符的编码

在使用BeautifulSoup解析HTML文件的过程中，经常使用到类似如下的语句：

soup = BeautifulSoup(html)
for string in soup.strings:
    string = string.strip()

注意，上述代码中，没有对string的编码进行任何的设置，所以string的默认编码为ascii。

如果不注意，这种默认的编码方式，会带来很大的麻烦。

即便你使用：string = string.decode('ascii').encode('utf-8')

也不能转换string的编码。

所以最好的方式如下所示：

soup = BeautifulSoup(html)
for string in soup.strings:
    string = str(string)
    string = string.strip()

经过str()函数转换后，如果你系统中默认使用的是UTF-8编码，则string就变成了UTF-8编码。

经同学指点，发现在建立soup的时候，可以指定字符的编码：

soup = BeautifulSoup(html, from_encoding='utf-8')

posted @ 2013-05-06 21:11 java程序员填空阅读(216) 评论(0) 编辑收藏举报

刷新页面返回顶部

java程序员填空