python3.4爬取网页的乱码问题

python学习资料文档知识点链接http://bbs.fishc.com/forum.php?mod=forumdisplay&fid=243&filter=typeid&typeid=403

1.如果此处有非法字符 gbk 此网页的编码为gbk2312 用‘ignore’屏蔽

 先进行本网也得gdk解码 再用本地的utf-8编码
 print html.read().decode('gbk','ignore').encode('utf-8') 此处统一gbk

2.下载自动检测字符集的包

 百度快照 下载chardet  解压后吧  chardet文件夹移到site-package

字符集包的下载地址:

http://cache.baiducontent.com/c?m=9f65cb4a8c8507ed4fece76310549c24424380147e9c964f22888448e4391b145a24a8f97c3f415e80852a3047bb0c01aaa63928714562a09ab89f4baeac925938885623716cc40a50880eaebb5125b637912aabe45fbde7ac2592dec5d3a84352ba0e452f97f0fa184b569178f06560b9f5d91e4219&p=8e769a478d9b19e517bd9b7d081d81&newp=927dd51885cc43ec08e2977b065e90231601d13523808c0a3b8fd12590605e55113d8eff7062515f8e99736301a4495deaf031713d032bb79bc98e4adbb8866e42c970767f4bda1751&user=baidu&fm=sc&query=https//pypi%2Epython%2Eorg/pypi/chardet&qid=93c574ac0003de5a&p1=1

 

posted @ 2016-07-07 11:38  IT小甲鱼  阅读(333)  评论(1编辑  收藏  举报